Ruby 用 open-uri 抓取网页

crazyjin · 2013年10月19日 · 最后由 huacnlee 回复于 2013年11月06日 · 6963 次阅读

#! /usr/local/bin/ruby
require 'nokogiri'
require 'open-uri'
c = 0
puts DateTime.now.to_s

begin
  while true do 
    #Nokogiri::HTML(open("wwww.example.com/"))
    data = open("http://www.example.com"){|f| f.read}
    c = c + 1
    print '.'
  end
rescue OpenURI::HTTPError => e
  $stderr.puts e.to_s
  puts c.to_s
  puts DateTime.now.to_s
end

平均就 60 秒 100 多一点和浏览器比起来，简直就是龟速啊。有的站点，还会遇到 403 Forbidden。

用 em-http-request 写了一个测试用例，感觉比 open-uri 还慢样

#! /usr/local/bin/ruby

require 'eventmachine'
require 'em-http'
require 'nokogiri'

@count = 0
@topic_ids =%w{14854 11168 14769 14875} 
@conn = EventMachine::HttpRequest.new('http://www.ruby-china.org')


EventMachine.run{
  @topic_ids.each do |id|
    options = {
      :redirects => 5,
      :keepalive => true,
      :path => "",
      :head => {
        'accept' => 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
        'accept-encoding' => 'gzip,deflate,sdch',
        'accept-language' => 'en-US,en;q=0.8,zh-CN;q=0.6,zh;q=0.4',
        'cache-control' => 'max-age=0',
        'user-agent' => 'Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.76 Safari/537.36'
      }
    }

    options[:path] = "/topics/#{id}"
    req1 = @conn.get options
    req1.errback{|req| 
      @count = @count + 1; EventMachine.stop if @topic_ids.length == @count
    }
    req1.callback{
      #doc = Nokogiri::HTML(req1.response)
      puts req1.response
      @count = @count + 1; EventMachine.stop if @topic_ids.length == @count
    }
  end
}

1 个赞

chunlea #0 2013年10月19日

这个测时间的方法不准吧……

xiaogui #1 2013年10月19日

403 应该是对方做了限制了吧。表示用'nokogiri'和'open-uri'曾经抓了 600w 页面没问题。

sevk #2 2013年10月19日

越快，反而越容易被服务器 BAN。

crazyjin #3 2013年10月19日

#2 楼 @xiaogui 描述一下呢用的什么机器，方法，速度怎么样？

crazyjin #4 2013年10月19日

#3 楼 @sevk 浏览器访问速度灰常快，浏览器用的 http1.1，而且多个链接同时进行。。咱们的 open 我估计是建立一次链接下载一个网页，这样人家服务器肯定受不了。。我在想有没有用 HTTP1.1 抓网页的代理。

jjym #5 2013年10月19日

用 mechanize，要速度的话再去研究下 em-http

as181920 #6 2013年10月19日

借道问下，现在越来越多的网页内容是通过 js 加载的，怎么爬取呢？

crazyjin #7 2013年10月19日

#6 楼 @jjym 简单看了一下介绍，，em-http 就是我要找的东西，，谢了。。。

xiaogui #8 2013年10月19日

#4 楼 @crazyjin 就用的自己的 macbook，没算时间，反正不快，但也不至于你说的 60 多秒。

xiaogui #9 2013年10月19日

#4 楼 @crazyjin 原始代码，有一些简单的逻辑

# encoding: utf-8
desc 'Fetch file_record'
task :fetch_file_record => :environment do
  require 'rubygems'
  require 'nokogiri'
  require 'mechanize'
  require 'open-uri'
  require 'json'

  agent = Mechanize.new
  agent.user_agent_alias = 'Mac Safari'

  bath_path = 'http://www.example.com'

  categories = Category.all

  categories.each do |category|
    continue = true

    category_path = '/all/' + category.path + '/'
    page_num = 1

    while continue
      url = bath_path + category_path + page_num.to_s
      begin
        doc = Nokogiri::HTML.parse(agent.get(url).body, nil, 'utf-8')
      rescue Exception => e
        doc = e.page
      end

      if doc.title == '404 Not Found'
        continue = false
        puts '获取到空页面'
        puts '空页面 url 为：' + url
      else
        url_list = doc.css("#content a")
        i = 0
        url_list.each do |item|
          herf = item[:href]
          if item[:href] != 'http://www.example.com' && !herf.index(category_path)
            i += 1
            text = item.text
            category_name_begin = text.rindex('[')
            if category_name_begin
              title = text[0, category_name_begin]
            else
              title = text
            end

            file_record = FileRecord.create(
                :title =>  title,
                :category_id => category.id,
                :url  =>  item[:href],
                :status  =>  0,
                :page_num =>  page_num,
                :page_internal_num => i
            )

            puts  category.path + '_file_record.id:' + file_record.id.to_s
          end
        end
      end

      puts '~~~~~~~~~~~~~~~~~~~~~~~~~~~~~'

      page_num = page_num + 1

    end

    puts category.path + '结束'

  end

crazyjin #10 2013年10月23日

#6 楼 @jjym 能帮我看看这段代码吗？为什么这么慢？options 里的 header 是我在 chrome 里 copy 出来的。

#! /usr/local/bin/ruby

require 'eventmachine'
require 'em-http'
require 'nokogiri'

@count = 0
@topic_ids =%w{14854 11168 14769 14875} 
@conn = EventMachine::HttpRequest.new('http://www.ruby-china.org')
@options = {
  :redirects => 5,
  :keepalive => true,
  :path => "",
  :head => {
    'accept' => 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'accept-encoding' => 'gzip,deflate,sdch',
    'accept-language' => 'en-US,en;q=0.8,zh-CN;q=0.6,zh;q=0.4',
    'cache-control' => 'max-age=0',
    'user-agent' => 'Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.76 Safari/537.36'
  }
}

EventMachine.run{
  @topic_ids.each do |id|
    @options[:path] = "/topics/#{id}"
    req1 = @conn.get @options
    req1.errback{|req| 
      puts 'errback..............'; 
      p req
      @count = @count + 1; EventMachine.stop if @topic_ids.length == @count
    }
    req1.callback{
      puts req1.req.path
      #doc = Nokogiri::HTML(req1.response)
      p req1.response_header
      @count = @count + 1; EventMachine.stop if @topic_ids.length == @count
    }
  end
}