Ruby nokogiri 采集网页的乱码问题

lrbnew · February 17, 2014 · 2788 hits

今天采集网页内容,遇到乱码问题,这里讨论的很热烈http://ruby-china.org/topics/2484, 但是原帖内容多且杂乱,看的费劲,这里把解决问题的几段贴出来,方便大家查找参考:

@hooopo的办法 1:

html = open(url).read html.force_encoding("gbk") html.encode!("utf-8") doc = Nokogiri::HTML.parse html doc.css("body")

@hooopo的办法 2:

html = open(url).read html = Iconv.conv("utf-8", "gbk", html) doc = Nokogiri::HTML.parse html doc.css("body")

需要注意,以上代码都是先转码,然后解析。最后,感谢 hooopo 的分享。

No Reply at the moment.
You need to Sign in before reply, if you don't have an account, please Sign up first.