Rails 单机数据抓取性能提升总结篇

michael_roshen · December 10, 2014 · Last by neocanable replied at February 24, 2015 · 9644 hits

Topic has been selected as the excellent topic by the admin.

写了一段抓据数据的代码，发现速度并不是很快，下面是我做的一些优化，速度提升了不少，但是还是感觉不够快，你还有更快的方法吗？一起分享一下吧最终单机测试，4w 左右/1 小时，Ubuntu14 2 核 8G 内存

第一版

Team 有上百万，单次循环太慢使用 mechanize 抓取数据，并解析，耗时 0.4s 左右 news 虽然使用了单次 transaction 提交，但是还是最耗时的操作

Team.find_each do |team|
    begin
        team_id = team.id
        team_name  = team.try(:name)
        puts team_name
        news = FetchNews.get_touch_news(team_name)
            news.each do |params|
               TeamNews.transaction do 
                   TeamNews.create(params.merge!(team_id: team_id))
               end
             end
    rescue => e
        puts "something woring with team_id #{team_id}: #{e}"       
    end
end

第二版

使用文件临时存储抓取的数据，当数量超过 1w 的时候，使用 LOAD DATA LOCAL INFILE 一次写入，只需要 0.00xs, 删除临时文件，进行下一次迭代，依次类推，节省了不少数据库操作，性能上来不少。使用 LOAD DATA LOCAL INFILE 注意事项：

唯一性验证：保证批量导入后不会重复，给标题添加数据库唯一性验证。 ALTER TABLE table_names ADD unique(title);
开启上传本地文件到服务器命令行> mysql -h ip -u user_name -p --local-infile=1 rails: config/database.yml中配置local_infile: true 否则会报错：The used command is not allowed with this MySQL version: LOAD DATA LOCAL INFILE 参考资料： http://dev.mysql.com/doc/refman/5.6/en/load-data.html http://stackoverflow.com/questions/19819206/load-data-infile-error-1064 http://stackoverflow.com/questions/21256641/enabling-local-infile-for-loading-data-into-remote-mysql-from-rails 抓取数据 FetchNews.get_touch_news 的速度提升空间不大，网络传输的时间很难缩减，这一步暂时不处理下一步考虑的是把 company 进行分组，同时启动多个进程，使用了 resque, 把 id 分组存入到 resque 中，再从 worker 中读取

file_path = "#{Rails.root}/fetch_news.txt"
Team.find_each do |team|
    begin
        File.delete(file_path) if File::exists?(file_path)
        team_id = team.id
        team_name  = team.try(:name)
        puts team_name
        news = FetchNews.get_touch_news(team_name)
        File.open(file_path,"a") do |file|
            news.each do |a_new|
                result = [a_new[:title], a_new[:url], a_new[:date], team_id].join(";").concat("\n")
                file.puts(result)
            end
        end
    rescue => e
        puts "something woring with team_id #{team_id}: #{e}"       
    end
    ActiveRecord::Base.connection.execute("
        LOAD DATA LOCAL INFILE '#{file_path}' INTO TABLE db.team_news 
        FIELDS TERMINATED BY ';' LINES TERMINATED BY '\n' (title,url,date,team_id);
    ")
    File.delete(file_path) if File::exists?(file_path)
end

第三版

用进程 id 来区分不同的临时文件，其他的跟上面的一样，这样开启 6 个 worker，速度就提升了 6 倍

class TeamFetchNewsWorker
  @queue = :team_fetch_news_worker
  def self.perform(team_ids)
     pid = $$
     file_path = "#{Rails.root}/fetch_news_#{pid}.txt"  
    ...
  end
end

第四版

使用 ruby 多线程机制，给每个 worker 多开几个线程，试试那么 resque job 中，这样存 ids, 一共存了 4 组，因为我要开启 4 个线程（后面测试开了 10 个，数据库链接不够用了，so...）这样速度又提升了 4 倍！

[[1,2,3,4...],[5,6,7,8...],[9,10,11,12...],..]

class TeamFetchNewsWorker
  @queue = :team_fetch_news_worker

  def self.perform(team_ids_group)
    pid = $$
    threads = []
    team_ids_group.each_with_index do |ids, index|
        file_path = "#{Rails.root}/fetch_news_#{pid}_#{index}.txt"  
        File.delete(file_path) if File::exists?(file_path)
        threads << Thread.new do 
            Team.find(ids).each_with_index do |team, index|
                ...
            end
        end
    end
    threads.each{|t|t.join}
  end
end

第五版

从业务角度出发，第一次跑的时候，根据抓取结果给 Team 打上标签，指定抓取的级别根据级别指定抓取频率，初始化后的效果还会节省不少时间。性能倒是提升不了，但是目标达到了，避免了大量的无用的抓取时间

查看更多博客：http://michael-roshen.iteye.com/blog/2164721 微信：ruby 程序员

18 likes

winnie #0 December 10, 2014

What if 1 Sidekiq process could do the work of 20 Resque or DelayedJob processes?

用 Sidekiq + JRuby 试试

huhongda #1 December 10, 2014

你试试多台机器 + sidekiq 异步抓取，一台机器写入数据，多台机器解析以及抓取数据！sidekiq 多线程机制可以充分利用 CPU!

wppurking #2 December 10, 2014

我就觉得你不用太压榨一台服务器的性能。这 Sidekiq + Redis + Docker 的存在，让你横向扩展太方便了. 我们类似的业务，横向 Docker 上去 14 台抓取，一天更新 200w ~300w 的数据 (有效更新数据，抓取的数据不止这么多哈).

就是被对方当机器人有点麻烦...

2 likes

huhongda #3 December 10, 2014

@wppurking 可以通过代理抓取，去写一套获取代理的程序，遍历代理 IP 去抓

xiaogui #4 December 10, 2014

#3 楼 @wppurking 逐步下调对某一网站的抓取速度，找到不被封且最快的抓取频率。通过长期工作，将网站对单一 IP 的限制找到并使用在抓取工程中。智能的将多网站搭配进一个抓取序列，增加抓取的成功率和速度。

rfei #5 December 10, 2014

看起来不是我一人经常采用抓取的方式获取数据

michael_roshen #6 December 10, 2014

#5 楼 @xiaogui 这么机智，我要举报你

michael_roshen #7 December 10, 2014

#3 楼 @wppurking #2 楼 @huhongda #1 楼 @winnie

看来 Sidekiq 值得一试，Sidekiq KO Resque

Resque:

Pros:

does not require thread safety (works with pretty much any gem out there); has no interpreter preference (you can use any ruby); Resque currently supports MRI 1.9.3 or later loads of plugins.

Cons

runs a process per worker (uses more memory); does not retry jobs (out of the box, anyway).

Sidekiq:

Pros

runs thread per worker (uses much less memory); less forking (works faster); more options out of the box.

Cons

[huge] requires thread-safety of your code and all dependencies. If you run thread-unsafe code with threads, you're asking for trouble; works on some rubies better than others (jruby and rubinius are recommended, efficiency on MRI is decreased due to GVL (global VM lock)).

rocLv #8 December 10, 2014

#4 楼 @huhongda 貌似没多大用

huhongda #9 December 10, 2014

@rocLv 应该是有用的~我之前用过

winnie #10 December 11, 2014

楼主，我昨晚梦到你把换成 JRuby + Sidekiq 之后的结果告诉我了，期待中。

xiaogui #11 December 11, 2014

#7 楼 @michael_roshen 这东西要考虑易用性，效率，长久性。

michael_roshen #12 December 11, 2014

#2 楼 @huhongda sidekiq 多线程怎么搞，没查到相关资料

huhongda #13 December 11, 2014

@michael_roshen 你看一下 sidekiq 的源码！我看源码里是多线程

seeyoup #14 December 11, 2014

如果 ruby 能够像 node.js 一样实现异步非阻塞 IO 模型就简单多了。

michael_roshen #15 December 12, 2014

#11 楼 @winnie #10 楼 @huhongda

指条明道，我试了一下 sidekiq，多线程要怎么写，我看了源代码，也没看明白怎么搞多线程，倒是有个 sidekiq_pro

hooopo #16 December 12, 2014

你这个帖子给我启发很大，其实数据抓取也是一个 ETL 过程。并行问题其实最终拼的是如何做 Map Reduce。

17 Floor has deleted

wppurking #18 December 12, 2014

#5 楼 @xiaogui 现在总思路差不多，但这块功能的优先级没业务那边高... 所以慢慢找到一个每秒可抓取的上限，然后就是通过扩容机器来增加抓取量了~

wppurking #19 December 12, 2014

#4 楼 @huhongda 实现代理轮询功能 + 寻找代理 IP, 其实我更喜欢现在的 API 创建云主机然后自动部署接入抓取服务器群中

wppurking #20 December 12, 2014

#13 楼 @michael_roshen "sidekiq 多线程怎么搞，没查到相关资料" => 看下介绍哈应该可以解决你的问题。

michael_roshen #21 December 12, 2014

#21 楼 @wppurking 没东西呢？ Google Drive The app is currently unreachable.

wppurking #22 December 12, 2014

#22 楼 @michael_roshen 得科学上网...

michael_roshen #23 December 12, 2014

#23 楼 @wppurking

winnie #24 December 12, 2014

https://github.com/mperham/sidekiq/wiki/Advanced-Options

config/sidekiq.yml Concurrency 这个参数就是线程数

xiaogui #25 December 12, 2014

#19 楼 @wppurking 嗯，先抓主要需求

michael_roshen #26 December 12, 2014

#21 楼 @wppurking 兄台可否发我一份，I can't climb over the wall, [email protected], tx!!!

michael_roshen #27 December 12, 2014

#25 楼 @winnie

sidekiq 默认会开启 25 个线程，但是 perform 还是一个一个的执行，可能是我的写法有问题吧，再找点资料看看

cinic #28 December 12, 2014

#9 楼 @rocLv 可能是代理 IP 的问题，之前我也抓过代理 IP 来抓数据，发现不是每个代理 IP 都有效，所以抓过来的代理 IP 先验证一下（就是用这个代理 IP 去抓一个网页），能抓到的就保留，这样留下的代理 IP 基本都能用了，在抓数据是随机换着用。

nxbtch #29 December 12, 2014

网络情况怎么样呢？网络情况对这个速率影响还是比较大把，按照 4 万/每小时来算的话，平均 90ms 完成一次处理，网络情况怎么也得比这个要好

nxbtch #30 December 12, 2014

我自己测试了一下，用 nodejs 抓了一下 jandan.net , 可以到 1 万/每小时左右，阿里云最低配，jandan.net 的 ping 大概 20-30ms 左右

michael_roshen #31 December 15, 2014

#30 楼 @nxbtch 单次一页抓 10 条，有一些可能没有，120ms 左右，因为业务原因，有数据库写操作，影响性能，单单抓取和存储的话，速度还能快一点

huhongda #32 December 17, 2014

@michael_roshen 25 楼给了答案！

h_minghe #33 December 26, 2014

好文，赞。

dimi #34 January 29, 2015

VERY useful

neocanable #35 February 24, 2015

其实，与其用 sidekiq，不如自己写个多进程的脚本，直接 fork 简单的小列子，需要开 10 个进程：

(1..10).each do |i|
  fork do
     # do your job!
  end
end

发一张多进程下载的工作照

我的机器是 ubuntu12.04, i7, 16G, 非 rake，怕加载资源太多，最高能顶到 180 个进程，网络不行了如果网络靠谱，应该还能更高

You need to Sign in before reply, if you don't have an account, please Sign up first.