Ruby 被这个编码问题打败了

sakura79 · 2014年03月22日 · 最后由 sakura79 回复于 2014年04月01日 · 2971 次阅读

同样的代码和测试网页,在 Mac 下能用,在 Ubuntu 就出错。两台机器都是 Ruby2.0.0,locale 设置也一样。

用 Hpricot 解析网页。

网页是 gb2312 编码的,读取的时候转为 UTF-8 了。

y = Iconv.iconv("UTF-8//IGNORE", "gb2312",f.read)[0]

irb(main):069:0> y.encoding => #Encoding:UTF-8

irb(main):070:0> d1 = Hpricot(y) ArgumentError: invalid byte sequence in US-ASCII from /usr/local/lib/ruby/gems/2.0.0/gems/hpricot-0.8.6/lib/hpricot/builder.rb:10:in gsub' from /usr/local/lib/ruby/gems/2.0.0/gems/hpricot-0.8.6/lib/hpricot/builder.rb:10:inuxs' from /usr/local/lib/ruby/gems/2.0.0/gems/hpricot-0.8.6/lib/hpricot/inspect.rb:63:in block (2 levels) in pretty_print_stag' from /usr/local/lib/ruby/gems/2.0.0/gems/hpricot-0.8.6/lib/hpricot/inspect.rb:60:ineach' from /usr/local/lib/ruby/gems/2.0.0/gems/hpricot-0.8.6/lib/hpricot/inspect.rb:60:in block in pretty_print_stag' from /usr/local/lib/ruby/2.0.0/prettyprint.rb:381:ingroup' from /usr/local/lib/ruby/gems/2.0.0/gems/hpricot-0.8.6/lib/hpricot/inspect.rb:56:in pretty_print_stag' from /usr/local/lib/ruby/gems/2.0.0/gems/hpricot-0.8.6/lib/hpricot/inspect.rb:41:inblock in pretty_print' from /usr/local/lib/ruby/2.0.0/prettyprint.rb:381:in group' from /usr/local/lib/ruby/gems/2.0.0/gems/hpricot-0.8.6/lib/hpricot/inspect.rb:40:inpretty_print' from /usr/local/lib/ruby/2.0.0/pp.rb:155:in block in pp' from /usr/local/lib/ruby/2.0.0/prettyprint.rb:381:ingroup' from /usr/local/lib/ruby/2.0.0/pp.rb:155:in pp' from /usr/local/lib/ruby/gems/2.0.0/gems/hpricot-0.8.6/lib/hpricot/inspect.rb:47:inblock (2 levels) in pretty_print' from /usr/local/lib/ruby/gems/2.0.0/gems/hpricot-0.8.6/lib/hpricot/inspect.rb:47:in each' from /usr/local/lib/ruby/gems/2.0.0/gems/hpricot-0.8.6/lib/hpricot/inspect.rb:47:inblock in pretty_print' ... 11 levels...

从出错信息看,貌似当 US-ASCII 处理了?

大家以后找工作看看他家主页是不是 UTF-8 编码的,不是的话再斟酌一下吧。

另外这个 Hpricot 项目已经被 closed 了,大家还是用 nokogiri 吧。

你确定网页是 gbk 编码的话那就

'''ruby site_body = site_body.force_encoding('gb2312')

然后

site_body_utf8 = site_body.encode('utf-8') '''

上面库出了什么问题不知道,但这个库处理后告诉 ruby 这个字符串是 ascii 码了,可 ruby 一处理发现编码的二进制字节不对然后就报告错误了

那个 iconv 是调命令行的 iconv 的吗?如果是的话,貌似 gb2312 编码得用 gbk 来读。

y = Iconv.iconv("UTF-8//IGNORE", "gbk",f.read)[0]

#2 楼 @bom_d_van 只有执行环境不同,其它都是一样的,包括目标网页。

需要 登录 后方可回复, 如果你还没有账号请 注册新账号