Gem 一个简单的 spider 可用来下载网站文档

wongyouth · 2013年06月01日 · 最后由 simlegate 回复于 2013年08月01日 · 3804 次阅读

没找到一个下载网站的好用的蜘蛛，顺手写了一个简单小巧的，胜在速度很快，可以用来下技术文档站，不要乱用哦，会拖死网站滴！

https://github.com/wongyouth/speed_spider

Installation

install it with rubygem:

gem install 'speed_spider'

Usage

Usage: spider [options] start_url

options: -S, --slient slient output -D, --dir String directory for download files to save to. "download" by default -b, --base_url String any url not starts with base_url will not be saved -t, --threads Integer threads to run for fetching pages, 4 by default -u, --user_agent String words for request header USER_AGENT -d, --delay Integer delay between requests -o, --obey_robots_text obey robots exclustion protocol -l, --depth_limit limit the depth of the crawl -r, --redirect_limit Integer number of times HTTP redirects will be followed -a, --accept_cookies accept cookies from the server and send them back? -s, --skip_query_strings skip any link with a query string? e.g. http://foo.com/?u=user -H, --proxy_host String proxy server hostname -P, --proxy_port Integer proxy server port number -T, --read_timeout Integer HTTP read timeout in seconds -V, --version Show version

Example

spider http://twitter.github.io/bootstrap/

It will download all files within the same domain as twitter.github.io, and save to download/twitter.github.io/.

spider -b http://ruby-doc.org/core-2.0/ http://ruby-doc.org/core-2.0/

It will only download urls start with http://ruby-doc.org/core-2.0/, notice assets files like image, css, js, font will not obey base_url rule.

2 个赞

yesmeck #0 2013年06月01日

用 wget 就可以吧。

$ wget -m -p -E -k -K -np http://twitter.github.io/bootstrap/

3 个赞

wongyouth #1 2013年06月02日

@yesmeck 不知道 wget 还有这个用法，稍微研究了一下果然很强大！相比下来因为是单线程下载速度比较慢一点。

zlx_star #2 2013年06月02日

需要多线程： axel -o ~/Downloads/bootstrap http://twitter.github.io/bootstrap/

1 个赞

wongyouth #3 2013年06月02日

@zlx_star 这个好像是用来分段下载大文件用的，场景不太一样吧

simlegate #4 2013年08月01日

axel -n 1000 ************************

需要登录后方可回复, 如果你还没有账号请注册新账号

2 个赞

共收到 5 条回复

收到新回复，点击立即加载