开源项目 Ruby 里有没有比较成熟的爬虫框架

vus520 · 2013年10月07日 · 最后由 pathbox 回复于 2015年05月21日 · 18310 次阅读

目前来有这几个需求

爬行、内容抽取、下载和存储
支持代理库及代理池，最好能有二级代理功能，可以自己抓取代理列表并验证有效性
分布式，主服务器能将数据分发到从服务器并能完成数据收集

看到社区里有位同志推荐了黄先生的作品，看了一下，在内容抽取方面应该没有大问题 https://github.com/code4craft/webmagic

分布式和代理这块，ruby 有没有比较稳定的 gem，要想实现一个稳定长期的数据采集服务。

10 个赞

tiseheaini #0 2013年10月07日

对于爬虫没什么经验，爬虫也需要用框架来写吗？求指教

doun #1 2013年10月07日

有人说用 nokogirl 爬东西好用，不过要针对特定网站，一般的爬虫，用 Ruby 不用考虑性能问题？

046569 #2 2013年10月07日

#2 楼 @doun 你要相信它足够快，这个时候最大的瓶颈恐怕是网络。

1 个赞

juanito #3 2013年10月07日

https://github.com/hooopo/direct_web_spider

vus520 #4 2013年10月07日

#1 楼 @tiseheaini 用框架的目的，是把除特殊采集规则抽取逻辑之外的调度、存储、代理做成模块，方便使用，可以在一套框架里实现多个需求的内容抓取。

1 个赞

vus520 #5 2013年10月07日

#3 楼 @046569 是的，性能不是目前考虑的问题，可以把网络瓶颈和计算瓶颈用分布式和任务分发系统去分解。

huacnlee #6 2013年10月08日

Anemone web-spider framework https://github.com/chriskite/anemone

Multi-threaded design for high performance
Tracks 301 HTTP redirects
Built-in BFS algorithm for determining page depth
Allows exclusion of URLs based on regular expressions
Choose the links to follow on each page with focus_crawl()
HTTPS support
Records response time for each page
CLI program can list all pages in a domain, calculate page depths, and more
Obey robots.txt
In-memory or persistent storage of pages during crawl, using TokyoCabinet, SQLite3, MongoDB, or Redis

2 个赞

cisolarix #7 2013年10月08日

#7 楼 @huacnlee 更新很少了？

vus520 #8 2013年10月08日

#7 楼 @huacnlee 差不多就是我要的，哈哈！感谢。

我看有很多 pull request 没有更新，不知道 fork 出去的人，有没有继续维护的版本

huacnlee #9 2013年10月08日

#8 楼 @cisolarix 这个我还没用过，不过看介绍，功能很丰富了

changwu #10 2013年10月08日

#9 楼 @vus520 Anemone 好像是整站爬。如果定向爬网站，建议参考https://github.com/hooopo/direct_web_spider自己做一个。我们基于这个框架做了一阵，马马虎虎对付过去了。不过上述框架都很久没有维护了，还是自己搞一个吧。mechanize+nokogirl 可以搞定一切。