<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0">
  <channel>
    <title>imtinge (imtinge)</title>
    <link>https://ruby-china.org/imtinge</link>
    <description/>
    <language>en-us</language>
    <item>
      <title>爬页面频率一高，服务器就返回 403，求破</title>
      <description>&lt;p&gt;帮同学爬本地的一个新闻网站，需要获取标题、内容，发布日期和点击量&lt;/p&gt;

&lt;p&gt;点击量是通过 js 生成的，&lt;/p&gt;
&lt;pre class="highlight erb"&gt;&lt;code&gt;&lt;span class="nt"&gt;&amp;lt;script &lt;/span&gt;&lt;span class="na"&gt;src=&lt;/span&gt;&lt;span class="s"&gt;"/count/index?id=2838670&amp;amp;amp;siteid=199"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&amp;lt;/script&amp;gt;&lt;/span&gt;
次
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;全地址&lt;/p&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;http://www.trs.gov.cn/count/index?id=2838670&amp;amp;siteid=199
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;通过连接地址，可以取到内容&lt;/p&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;document.write("195")
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;使用如下脚本也可以取到&lt;/p&gt;
&lt;pre class="highlight ruby"&gt;&lt;code&gt;&lt;span class="n"&gt;pubdate&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;count_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;post&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;scan&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sr"&gt;/t(\d{8}_\d+)\./&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;first&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;first&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'_'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;count_link&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"http://www.xxx.cn/count/index?id=&lt;/span&gt;&lt;span class="si"&gt;#{&lt;/span&gt;&lt;span class="n"&gt;count_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;&amp;amp;siteid=&lt;/span&gt;&lt;span class="si"&gt;#{&lt;/span&gt;&lt;span class="n"&gt;site_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;span class="k"&gt;begin&lt;/span&gt;
  &lt;span class="n"&gt;html&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="no"&gt;Nokogiri&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="no"&gt;HTML&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;count_link&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'User-Agent'&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="s1"&gt;'ruby'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
  &lt;span class="n"&gt;hits&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;html&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;text&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;scan&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sr"&gt;/\d+/&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;first&lt;/span&gt;
&lt;span class="k"&gt;rescue&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;
  &lt;span class="nb"&gt;puts&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="si"&gt;#{&lt;/span&gt;&lt;span class="n"&gt;count_link&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;, &lt;/span&gt;&lt;span class="si"&gt;#{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;message&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
  &lt;span class="n"&gt;hits&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
&lt;span class="k"&gt;end&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;但一旦使用多线程访问 count_link 就会返回 403 错误，减少线程量，添加 sleep 可以降低出现 403 的几率，
一旦出现 403 使用浏览器访问也看不到点击量（之前使用 phantomjs 也出现过抓不到点击量的情况），我让外地的朋友帮我试了也看不到，看来并不是封了我的 IP，
这个 403 错误，会持续一段时间，时间到了以后，又可以正常的访问了&lt;/p&gt;

&lt;p&gt;本来是爬数据，当然希望越快越好，但是一快就报 403，心碎，求破
先谢谢大家了。&lt;/p&gt;</description>
      <author>imtinge</author>
      <pubDate>Mon, 18 Sep 2017 10:09:19 +0800</pubDate>
      <link>https://ruby-china.org/topics/34157</link>
      <guid>https://ruby-china.org/topics/34157</guid>
    </item>
    <item>
      <title>watir 报错 “`rbuf_fill': Net::ReadTimeout (Net::ReadTimeout)” 求助</title>
      <description>&lt;p&gt;新学 watir 的小白，正常运行一段时间后，就会报错，不用 thread 也一样，请各位大神指导一下&lt;img title=":joy:" alt="😂" src="https://twemoji.ruby-china.com/2/svg/1f602.svg" class="twemoji"&gt; &lt;/p&gt;

&lt;p&gt;代码如下：&lt;/p&gt;
&lt;pre class="highlight ruby"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_post_link_by_slice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;page_slice&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="no"&gt;Watir&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="no"&gt;Browser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt; &lt;span class="ss"&gt;:phantomjs&lt;/span&gt;

  &lt;span class="n"&gt;page_slice&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;each&lt;/span&gt; &lt;span class="k"&gt;do&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;
  &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;goto&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;

  &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="n"&gt;links&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ul&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="ss"&gt;class: &lt;/span&gt;&lt;span class="s2"&gt;"text-list"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;lis&lt;/span&gt;
  &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="n"&gt;links&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;each&lt;/span&gt; &lt;span class="k"&gt;do&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="n"&gt;li&lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;
  &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="n"&gt;link&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;li&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;href&lt;/span&gt;
  &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;link&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;end_with?&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'html'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;link&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;include?&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'trsyw'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="nb"&gt;puts&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="si"&gt;#{&lt;/span&gt;&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt; &lt;/span&gt;&lt;span class="si"&gt;#{&lt;/span&gt;&lt;span class="n"&gt;link&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
  &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="k"&gt;end&lt;/span&gt;
  &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="k"&gt;end&lt;/span&gt;
  &lt;span class="k"&gt;end&lt;/span&gt;

  &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;close&lt;/span&gt;
&lt;span class="k"&gt;end&lt;/span&gt;

&lt;span class="n"&gt;pages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;get_pages&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;threads&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
&lt;span class="n"&gt;pages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;each_slice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;300&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;do&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;
  &lt;span class="n"&gt;threads&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="no"&gt;Thread&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt; &lt;span class="k"&gt;do&lt;/span&gt;
  &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="n"&gt;get_post_link_by_slice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="k"&gt;end&lt;/span&gt;
&lt;span class="k"&gt;end&lt;/span&gt;

&lt;span class="n"&gt;threads&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;each&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;报错如下：&lt;/p&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;/Users/tanagisa/.rbenv/versions/2.4.1/lib/ruby/2.4.0/net/protocol.rb:176:in `rbuf_fill': Net::ReadTimeout (Net::ReadTimeout)
    from /Users/tanagisa/.rbenv/versions/2.4.1/lib/ruby/2.4.0/net/protocol.rb:154:in `readuntil'
    from /Users/tanagisa/.rbenv/versions/2.4.1/lib/ruby/2.4.0/net/protocol.rb:164:in `readline'
    from /Users/tanagisa/.rbenv/versions/2.4.1/lib/ruby/2.4.0/net/http/response.rb:40:in `read_status_line'
    from /Users/tanagisa/.rbenv/versions/2.4.1/lib/ruby/2.4.0/net/http/response.rb:29:in `read_new'
    from /Users/tanagisa/.rbenv/versions/2.4.1/lib/ruby/2.4.0/net/http.rb:1446:in `block in transport_request'
    from /Users/tanagisa/.rbenv/versions/2.4.1/lib/ruby/2.4.0/net/http.rb:1443:in `catch'
    from /Users/tanagisa/.rbenv/versions/2.4.1/lib/ruby/2.4.0/net/http.rb:1443:in `transport_request'
    from /Users/tanagisa/.rbenv/versions/2.4.1/lib/ruby/2.4.0/net/http.rb:1416:in `request'
    from /Users/tanagisa/.rbenv/versions/2.4.1/lib/ruby/2.4.0/net/http.rb:1409:in `block in request'
    from /Users/tanagisa/.rbenv/versions/2.4.1/lib/ruby/2.4.0/net/http.rb:877:in `start'
    from /Users/tanagisa/.rbenv/versions/2.4.1/lib/ruby/2.4.0/net/http.rb:1407:in `request'
    from /Users/tanagisa/.rbenv/versions/2.4.1/lib/ruby/gems/2.4.0/gems/selenium-webdriver-3.5.2/lib/selenium/webdriver/remote/http/default.rb:124:in `response_for'
    from /Users/tanagisa/.rbenv/versions/2.4.1/lib/ruby/gems/2.4.0/gems/selenium-webdriver-3.5.2/lib/selenium/webdriver/remote/http/default.rb:78:in `request'
    from /Users/tanagisa/.rbenv/versions/2.4.1/lib/ruby/gems/2.4.0/gems/selenium-webdriver-3.5.2/lib/selenium/webdriver/remote/http/common.rb:61:in `call'
    from /Users/tanagisa/.rbenv/versions/2.4.1/lib/ruby/gems/2.4.0/gems/selenium-webdriver-3.5.2/lib/selenium/webdriver/remote/bridge.rb:170:in `execute'
    from /Users/tanagisa/.rbenv/versions/2.4.1/lib/ruby/gems/2.4.0/gems/selenium-webdriver-3.5.2/lib/selenium/webdriver/remote/oss/bridge.rb:581:in `execute'
    from /Users/tanagisa/.rbenv/versions/2.4.1/lib/ruby/gems/2.4.0/gems/selenium-webdriver-3.5.2/lib/selenium/webdriver/remote/oss/bridge.rb:52:in `get'
    from /Users/tanagisa/.rbenv/versions/2.4.1/lib/ruby/gems/2.4.0/gems/selenium-webdriver-3.5.2/lib/selenium/webdriver/common/navigation.rb:32:in `to'
    from /Users/tanagisa/.rbenv/versions/2.4.1/lib/ruby/gems/2.4.0/gems/watir-6.8.4/lib/watir/browser.rb:82:in `goto'
    from test.rb:41:in `block in get_post_link_by_slice'
    from test.rb:40:in `each'
    from test.rb:40:in `get_post_link_by_slice'
    from test.rb:60:in `block in &amp;lt;main&amp;gt;'
    from test.rb:58:in `each_slice'
    from test.rb:58:in `with_index'
    from test.rb:58:in `&amp;lt;main&amp;gt;'
&lt;/code&gt;&lt;/pre&gt;</description>
      <author>imtinge</author>
      <pubDate>Wed, 13 Sep 2017 16:45:37 +0800</pubDate>
      <link>https://ruby-china.org/topics/34109</link>
      <guid>https://ruby-china.org/topics/34109</guid>
    </item>
  </channel>
</rss>
