没找到一个下载网站的好用的蜘蛛,顺手写了一个简单小巧的,胜在速度很快,可以用来下技术文档站,不要乱用哦,会拖死网站滴!
https://github.com/wongyouth/speed_spider
install it with rubygem:
gem install 'speed_spider'
Usage: spider [options] start_url
options: -S, --slient slient output -D, --dir String directory for download files to save to. "download" by default -b, --base_url String any url not starts with base_url will not be saved -t, --threads Integer threads to run for fetching pages, 4 by default -u, --user_agent String words for request header USER_AGENT -d, --delay Integer delay between requests -o, --obey_robots_text obey robots exclustion protocol -l, --depth_limit limit the depth of the crawl -r, --redirect_limit Integer number of times HTTP redirects will be followed -a, --accept_cookies accept cookies from the server and send them back? -s, --skip_query_strings skip any link with a query string? e.g. http://foo.com/?u=user -H, --proxy_host String proxy server hostname -P, --proxy_port Integer proxy server port number -T, --read_timeout Integer HTTP read timeout in seconds -V, --version Show version
spider http://twitter.github.io/bootstrap/
It will download all files within the same domain as twitter.github.io
, and save to download/twitter.github.io/
.
spider -b http://ruby-doc.org/core-2.0/ http://ruby-doc.org/core-2.0/
It will only download urls start with http://ruby-doc.org/core-2.0/
, notice assets
files like image, css, js, font will not obey base_url
rule.