同样的代码和测试网页,在 Mac 下能用,在 Ubuntu 就出错。两台机器都是 Ruby2.0.0,locale 设置也一样。
用 Hpricot 解析网页。
网页是 gb2312 编码的,读取的时候转为 UTF-8 了。
y = Iconv.iconv("UTF-8//IGNORE", "gb2312",f.read)[0]
irb(main):069:0> y.encoding => #Encoding:UTF-8
irb(main):070:0> d1 = Hpricot(y)
ArgumentError: invalid byte sequence in US-ASCII
from /usr/local/lib/ruby/gems/2.0.0/gems/hpricot-0.8.6/lib/hpricot/builder.rb:10:in gsub'
from /usr/local/lib/ruby/gems/2.0.0/gems/hpricot-0.8.6/lib/hpricot/builder.rb:10:in
uxs'
from /usr/local/lib/ruby/gems/2.0.0/gems/hpricot-0.8.6/lib/hpricot/inspect.rb:63:in block (2 levels) in pretty_print_stag'
from /usr/local/lib/ruby/gems/2.0.0/gems/hpricot-0.8.6/lib/hpricot/inspect.rb:60:in
each'
from /usr/local/lib/ruby/gems/2.0.0/gems/hpricot-0.8.6/lib/hpricot/inspect.rb:60:in block in pretty_print_stag'
from /usr/local/lib/ruby/2.0.0/prettyprint.rb:381:in
group'
from /usr/local/lib/ruby/gems/2.0.0/gems/hpricot-0.8.6/lib/hpricot/inspect.rb:56:in pretty_print_stag'
from /usr/local/lib/ruby/gems/2.0.0/gems/hpricot-0.8.6/lib/hpricot/inspect.rb:41:in
block in pretty_print'
from /usr/local/lib/ruby/2.0.0/prettyprint.rb:381:in group'
from /usr/local/lib/ruby/gems/2.0.0/gems/hpricot-0.8.6/lib/hpricot/inspect.rb:40:in
pretty_print'
from /usr/local/lib/ruby/2.0.0/pp.rb:155:in block in pp'
from /usr/local/lib/ruby/2.0.0/prettyprint.rb:381:in
group'
from /usr/local/lib/ruby/2.0.0/pp.rb:155:in pp'
from /usr/local/lib/ruby/gems/2.0.0/gems/hpricot-0.8.6/lib/hpricot/inspect.rb:47:in
block (2 levels) in pretty_print'
from /usr/local/lib/ruby/gems/2.0.0/gems/hpricot-0.8.6/lib/hpricot/inspect.rb:47:in each'
from /usr/local/lib/ruby/gems/2.0.0/gems/hpricot-0.8.6/lib/hpricot/inspect.rb:47:in
block in pretty_print'
... 11 levels...
从出错信息看,貌似当 US-ASCII 处理了?
大家以后找工作看看他家主页是不是 UTF-8 编码的,不是的话再斟酌一下吧。
另外这个 Hpricot 项目已经被 closed 了,大家还是用 nokogiri 吧。