Ruby 被这个编码问题打败了

sakura79 · March 22, 2014 · Last by sakura79 replied at April 01, 2014 · 2967 hits

同样的代码和测试网页,在 Mac 下能用,在 Ubuntu 就出错。两台机器都是 Ruby2.0.0,locale 设置也一样。

用 Hpricot 解析网页。

网页是 gb2312 编码的,读取的时候转为 UTF-8 了。

y = Iconv.iconv("UTF-8//IGNORE", "gb2312",f.read)[0]

irb(main):069:0> y.encoding => #Encoding:UTF-8

irb(main):070:0> d1 = Hpricot(y) ArgumentError: invalid byte sequence in US-ASCII from /usr/local/lib/ruby/gems/2.0.0/gems/hpricot-0.8.6/lib/hpricot/builder.rb:10:in gsub' from /usr/local/lib/ruby/gems/2.0.0/gems/hpricot-0.8.6/lib/hpricot/builder.rb:10:inuxs' from /usr/local/lib/ruby/gems/2.0.0/gems/hpricot-0.8.6/lib/hpricot/inspect.rb:63:in block (2 levels) in pretty_print_stag' from /usr/local/lib/ruby/gems/2.0.0/gems/hpricot-0.8.6/lib/hpricot/inspect.rb:60:ineach' from /usr/local/lib/ruby/gems/2.0.0/gems/hpricot-0.8.6/lib/hpricot/inspect.rb:60:in block in pretty_print_stag' from /usr/local/lib/ruby/2.0.0/prettyprint.rb:381:ingroup' from /usr/local/lib/ruby/gems/2.0.0/gems/hpricot-0.8.6/lib/hpricot/inspect.rb:56:in pretty_print_stag' from /usr/local/lib/ruby/gems/2.0.0/gems/hpricot-0.8.6/lib/hpricot/inspect.rb:41:inblock in pretty_print' from /usr/local/lib/ruby/2.0.0/prettyprint.rb:381:in group' from /usr/local/lib/ruby/gems/2.0.0/gems/hpricot-0.8.6/lib/hpricot/inspect.rb:40:inpretty_print' from /usr/local/lib/ruby/2.0.0/pp.rb:155:in block in pp' from /usr/local/lib/ruby/2.0.0/prettyprint.rb:381:ingroup' from /usr/local/lib/ruby/2.0.0/pp.rb:155:in pp' from /usr/local/lib/ruby/gems/2.0.0/gems/hpricot-0.8.6/lib/hpricot/inspect.rb:47:inblock (2 levels) in pretty_print' from /usr/local/lib/ruby/gems/2.0.0/gems/hpricot-0.8.6/lib/hpricot/inspect.rb:47:in each' from /usr/local/lib/ruby/gems/2.0.0/gems/hpricot-0.8.6/lib/hpricot/inspect.rb:47:inblock in pretty_print' ... 11 levels...

从出错信息看,貌似当 US-ASCII 处理了?

大家以后找工作看看他家主页是不是 UTF-8 编码的,不是的话再斟酌一下吧。

另外这个 Hpricot 项目已经被 closed 了,大家还是用 nokogiri 吧。

你确定网页是 gbk 编码的话那就

'''ruby site_body = site_body.force_encoding('gb2312')

然后

site_body_utf8 = site_body.encode('utf-8') '''

上面库出了什么问题不知道,但这个库处理后告诉 ruby 这个字符串是 ascii 码了,可 ruby 一处理发现编码的二进制字节不对然后就报告错误了

那个 iconv 是调命令行的 iconv 的吗?如果是的话,貌似 gb2312 编码得用 gbk 来读。

y = Iconv.iconv("UTF-8//IGNORE", "gbk",f.read)[0]

#2 楼 @bom_d_van 只有执行环境不同,其它都是一样的,包括目标网页。

You need to Sign in before reply, if you don't have an account, please Sign up first.