Ruby Mechanize 抓取包含中文乱码的问题

esseak · 2014年01月11日 · 最后由 esseak 回复于 2014年01月16日 · 3776 次阅读

想获取一些 jd 的数据，所以用 Mechanize 写了一个小的爬虫。代码中 body 打印出来所有中文都是乱码，用 mechanize 爬过另一个简单的站正常解析出中文怀疑是 encoding 的问题，能想到的 encoding 都改了还是没用，上来求助大家。代码如下

require 'mechanize'
require 'nokogiri'
require 'active_record'
require 'json'

# encoding: utf-8

class Jd_spider_engine
  def run
    Mechanize::Util::CODE_DIC[:SJIS] = "utf-8"
    agent = Mechanize.new
    agent.user_agent_alias = 'Mac Safari'
    agent.max_history = 1
    agent.open_timeout = 10
    #agent.page.encoding = 'utf-8'

    page = agent.get("http://list.jd.com/670-671-672-0-0-0-0-0-0-0-1-1-1-1-1-72-4137-0.html",nil,nil,{ 'Accept-Charset' => 'utf-8' }
    )
    page.encoding = 'utf-8'

    #测试中文
    page.search("div.iloading").children.each { |c|
      #puts c.to_s.force_encoding("utf-8")
       puts c
    }

    #body内容
    puts page.body
  end
end

jd_spider_engine = Jd_spider_engine.new
jd_spider_engine.run

esseak #0 2014年01月11日

额没人试试吗？

lululau #1 2014年01月12日

这个页面虽然声明了是 UTF-8，但实际是 GBK

cisolarix #2 2014年01月12日

楼主快用楼上的建议，改改看？！

michael_roshen #3 2014年01月12日

encoding: utf-8 的位置要放在第一行吧

esseak #4 2014年01月16日

@lululau @cisolarix @michael_roshen 谢谢大家了，Mechanize 里的 encoding 似乎不会帮转码的，所以设置 gbk 没用。我现在的解决方案是拿到数据后用 iconv 转一下：

contents = Iconv.conv('utf-8', 'gbk', page.body)

1 个赞

需要登录后方可回复, 如果你还没有账号请注册新账号