分享 Kimurai - 一个 Ruby 写的爬虫框架

Rei · 2018年08月12日 · 最后由 cao7113 回复于 2019年06月15日 · 7993 次阅读

Kimurai is a modern web scraping framework written in Ruby which works out of box with Headless Chromium/Firefox, PhantomJS, or simple HTTP requests and allows to scrape and interact with JavaScript rendered websites

很多地方借鉴 scrapy，但比 scrapy 好的地方：

Scrapy didn't support out of box easy scraping of Javascript rendered websites. It has Splash https://github.com/scrapy-plugins/scrapy-splash (Their own Headless browser, special for Scrapy) but even if so, it's not that easy to interact with Splash browser (click buttons, filling forms, etc.), you have to provide Lua (yes, not Python) script for that.

https://www.reddit.com/r/ruby/comments/95y0ru/kimurai_is_a_modern_web_scraping_framework/e3xnin7/

我没有爬虫需求所以没用过，有需求的人可以试试。

15 个赞

alixiaomiao #0 2018年08月12日

正在用它爬 hexpm

alixiaomiao #1 2018年08月12日

还是 nokogiri 用起来顺手，beautifulsoup 和 lxml 每次用都要查 API

gaicitadie #2 2018年08月12日

为什么就没人封装一个 jquery 选择器语法的，用 python 做爬虫很大原因就是因为有个 pyquery，ruby 也需要一个 rbquery

1 个赞

hooopo #3 2018年08月12日

对

gaicitadie 回复

nokogiri 就支持 css 选择器

1 个赞

cqcn1991 #4 2018年08月12日

有机会试试~

imwildcat #5 2018年08月12日

老实说 xpath 也不错

Rei #6 2018年08月12日

对

imwildcat 回复

xpath 可以做到一些 css 做不到的事，例如匹配 Text Node。

luikore #7 2018年08月12日

Scrapy 用得都是泪，几个痛点：

浏览器 (Splash) 和代理一起用就麻烦得一 b, 想要某些用代理某些不用代理，就得用隐藏的 dont_proxy 选项控制
Web 服务 (Scrapyd) 就只是一个单机服务而已，根本没考虑部署多台机器 (当然一般一台机器就够了不过扩展起来要自己搞很多东西)
Scrapy 设计的 parse callback 其实很不 flexible, 很难接收 request 的一些参数，如果想用 lambda 会发现有一堆限制 (例如里面不准用 statement), 于是就得切出去多搞一个类...
Python 处理字符串限制太多，正则的使用极其不方便，urllib 的一些辅助函数在各版本变化很大

基于这些，我觉得 kimurai 相当不错了 (即使用的人不多)

4 个赞

duobei #8 2018年08月13日

可以作为 Scrapy 的一个补充。

nightire #9 2018年08月13日

对

gaicitadie 回复

早在 2011 年，我第一次尝试 nokogiri 的时候就感叹：“我擦，这玩意儿用起来感觉和 jQuery 一样……”，结果到今天还能看到你抱怨没人封装一个 jQuery 选择语法的。可见之前我对你的评论并没有任何偏颇。

2 个赞

gaicitadie #10 2018年08月13日

对

nightire 回复

不是有个 css 选择器就能称为类似 jquery 的，如果这都能算“感觉和 jQuery 一样”只能说你的感官太丰富

chrishyman #11 2018年08月13日

@gaicitadie

1 个赞

alixiaomiao #12 2018年08月13日

对

gaicitadie 回复

那你觉得应该有什么功能呢

luikore #13 2018年08月13日

before(), after(), prependTo(), html(), text(), attr(), ... 是没有的，Nokogiri API 是不如 jQuery 那么好记

Rei #14 2018年08月13日

对

luikore 回复

为什么就没人封装一个 jquery 选择器语法的

前面只是说选择器语法。

fredwu #15 2018年08月13日

对

alixiaomiao 回复

爬 hexpm 的话，为何不用 https://github.com/fredwu/crawler 呢

nightire #16 2018年08月13日

对

gaicitadie 回复

在这里说的是爬虫，又不是 DOM Manipulation，除了选择器语法之外，你还指望 Nokogiri 哪里要做到和 jQuery 一样呢？

nine #17 2018年08月14日

之前尝试用 Ruby 做爬虫，最大的问题是字符集问题。

lucifer #18 2018年08月18日

在用 https://github.com/gocolly/colly

1 个赞

JGpirateKing #19 2018年08月18日

lol...被迫删掉了，项目里借鉴了一些作者半年前给一家公司干 freelancer 活时候写的代码具体看这吧：https://github.com/vfreefly/kimurai#repository-was-removed

Rei #20 2018年08月18日

对

JGpirateKing 回复

艹

2 个赞

21 楼已删除

JGpirateKing #22 2018年08月19日

对

Rei 回复

however... 这里很多 forks: https://github.com/vfreefly/kimurai/network/members

1 个赞

jonnoj #23 2018年08月19日

作者的最新更新：

Kimurai will be reopen soon

and will keep the same design and features. I have plans to rewrite a few parts of a framework so it will be 100% open source without any doubts. Also, there is work to do to allow run multiple crawlers from inside a single Ruby process (run crawlers using background jobs). And, there will be added "mechanize-only" mode (mechanize engine without Capybara dependency).

3 个赞