我通过 Nokogiri 抓取网页信息,在本地运行是没问题的,部署在远程 linux 上面之后 nokogiri 返回为空,查了好几天还是查不出来问题所在,还请各位大侠多多指教。 nokogiri 的版本是 1.5.5
代码如下 7 require 'rubygems' 8 require 'nokogiri' 9 require 'open-uri' 10 require 'real_estate_model' 11 12 class RealEstateCrawler 13 #to crawl the information of Beijing 14 def crawlRealEstateInfoBeijing
22 23 doc= Nokogiri::HTML(open('http://www.amazon.com/')) 24 p doc.text #==>returns empty ""
#2 楼 @zj0713001 嗯,返回是 HTTP/1.1 405 MethodNotAllowed Date: Sun, 10 Mar 2013 04:27:39 GMT Server: Server Set-Cookie: skin=noskin; path=/; domain=.amazon.com; expires=Sun, 10-Mar-2013 04:27:39 GMT x-amz-id-1: 0CD805WS74CX4T5MQEHG allow: POST, GET x-amz-id-2: 0v59rEQ7p6cTT7iWHW36eeLSdzKyNKJjTopSJGM+8N0Dk7BjtLw6HGwXLiQGd1ot Vary: Accept-Encoding,User-Agent Content-Type: text/html; charset=ISO-8859-1
这个返回算是正常返回码?呵呵,试了一下 google.com 的返回 HTTP/1.1 302 Found Location: http://www.google.co.in/ Cache-Control: private Content-Type: text/html; charset=UTF-8 Set-Cookie: PREF=ID=e215277947701215:FF=0:TM=1362889888:LM=1362889888:S=u5wkHr-Q9TBvmpyv; expires=Tue, 10-Mar-2015 04:31:28 GMT; path=/; domain=.google.com Set-Cookie: NID=67=NsonXLs34AafWa2INZAbNuc6WKGZKrGE38V5x0TkpkcWJIgRRjQ2nCNCpcFjNPlKHx0qlMAn924emQ_NQqYq7krbOI3uTcLeN5PoAEud-qzNETYF4WhX2gUvNdx4XO2p; expires=Mon, 09-Sep-2013 04:31:28 GMT; path=/; domain=.google.com; HttpOnly P3P: CP="This is not a P3P policy! See http://www.google.com/support/accounts/bin/answer.py?hl=en&answer=151657 for more info." Date: Sun, 10 Mar 2013 04:31:28 GMT Server: gws Content-Length: 221 X-XSS-Protection: 1; mode=block X-Frame-Options: SAMEORIGIN
#2 楼 @zj0713001 但是 www.google.com 的 nokogiri 返回也是空 23 doc= Nokogiri::HTML(open('http://www.google.com/')) 24 p doc.text
#5 楼 @liveljack google 的那个 302 应该是 OK 的 试试 mechanize 吧 puts Mechanize.new.get("http://www.google.com/") .body
#8 楼 @zj0713001 mechanize 的返回是对的,是页面的内容。但是 nokogiri 就是解析是空,难道是 nokogiri 的 bug?
#11 楼 @liveljack Nokogiri::HTML.parse(Mechanize.new.get("http://www.google.com/") .body, nil, 'utf-8') 试试呢
#13 楼 @liveljack 这个不实地看估计不行... 我猜测可能是 open 当作 agent 的话 对方的 server 对拒绝访问... 只是猜测...