分享 Sidekiq：导致 Frozen Worker 的两个风险

xiaoronglv · January 15, 2015 · Last by pathbox replied at January 27, 2016 · 7321 hits

今天生产环境的 Sidekiq 出现了一件奇怪的事情：16 个线程都处于 busy 状态。

[boohee@tiger ~]$ ps aux|grep plan
boohee   sidekiq 3.2.6 plan [16 of 16 busy]

所有的异步任务都卡住了，sidekiq.log 中也没有任何动静。

到底出了什么问题呢？我开始猜。

猜测 1：Sidekiq 进程假死了？

这个项目部署在两台服务器上，老虎和狮子。

老虎的 sidekiq 线程全忙。

[boohee@tiger ~]$ ps aux|grep plan
boohee   sidekiq 3.2.6 plan [16 of 16 busy]

狮子的 sidekiq 线程全忙

[boohee@lion ~]$ ps aux|grep plan
boohee   sidekiq 3.2.6 plan [16 of 16 busy]

两台服务器的 sidekiq 同时假死，不可能这么巧吧。

用 Capistrano 重启 Sidekiq，重启成功，堆积的异步任务开始执行。（Sidekiq 进程对信号还是有反应的）

5 分钟之后 sidekiq 的线程又全部处于 busy 状态，又卡住了！

猜测 1：某个很耗时的任务把所有的线程都阻塞了？

Google 后找到了 Sidekiq Problems and Troubleshooting 这篇文档，里面写了两种常见的 Frozen worker 的情况：

If all your workers are "frozen" or no jobs seem to be finishing, it's possible a remote network call is pending forever. This is common in two scenarios:

DNS lookup - resolving a hostname might hang. This has a serious side effect in MRI of locking up everything because of the way MRI uses DNS by default. The solution is to run require 'resolv-replace' in your initializer, which installs a pure Ruby DNS resolver that works concurrently.
Net::HTTP - unresponsive remote servers can cause a Net::HTTP call to hang and lock up your workers. Set open_timeout to ensure your code raises an exception rather than hanging forever.

看完了文档，我立马就想到了代码中的问题。

class XiaomiUserProjectNotifier
  include Sidekiq::Worker
  sidekiq_options retry: 3

  def perform(project_id)

    # 此处忽略一些业务代码

    begin
      res = RestClient.post url, params
    rescue => err
      raise err
    ensure

    # 此处忽略一些业务代码
  end
end

我使用 RestClient 调用了小米的接口，但这个接口不太稳定。当小米的接口没有响应时，Sidekiq 线程就在那傻傻的等待，几分钟...几个小时...几天...

最终，所有的线程都因为这个问题沦陷了（16 of 16 busy）

解决方法

知道问题，解决起来就简单了。

把 Restclient 超时时间设定为 10 秒，发布上线，这个问题就解决了。

class XiaomiUserProjectNotifier
  include Sidekiq::Worker
  sidekiq_options retry: 3

  def perform(project_id)

    # 此处忽略一些业务代码

    begin
      res = RestClient::Request.execute(
        :method => :post,
        :url => url,
        :payload => params,
        :timeout => 10,
        :open_timeout => 10
      )
    rescue => err
      raise err
    ensure

    # 此处忽略一些业务代码
  end
end

总结

异步任务中调用外部接口时，一定要加超时，否则 Sidekiq 有被冻住的风险。

20 likes

xiaoronglv #1 January 15, 2015

写完本文，我还有个疑惑：

Rails 有没有 require 'resolv-replace'？
需要在 sidekiq 中 require 'resolv-replace' 吗？

Rei #2 January 15, 2015

调用支付宝接口，不但要 timeout，还要 retry。

1 likes

serco #3 January 15, 2015

#1 楼 @xiaoronglv

多线程都需要注意一些 C Extensions 的 GIL block 住整个进程。

gazeldx #4 January 16, 2015

这文章，好！

hanluner #5 January 16, 2015

看来是针对小米用户发送的 mi push 信息啊。

zgm #6 January 16, 2015

绝世好文

huacnlee #7 January 16, 2015

IO 请求都是需要设置超时时间的

hooopo #8 January 16, 2015

#7 楼 @huacnlee 是的。unicorn 文档里对这个强调了好多次。。。还有一个常见的陷阱，把外部调用写在 after_save/update 之类 callback 里，导致事物长时间不关闭。

所有外部调用一定要设 timeout.
外部调用一定要 after_commit.

12 likes

dddd1919 #9 January 16, 2015

看完也提了个醒，以前做多线程也经常遇到进程被挂住了怪线程不稳定，应该设置超时 :plus1:

small_fish__ #10 January 16, 2015

#1 楼 @xiaoronglv 这里可以不考虑 resolv-replace 问题了，因为： https://github.com/mperham/sidekiq/blob/master/Changes.md#2160

#7 楼 @huacnlee IO 操作确实也是个值得注意的地方，本文所提问题应该也是如此。我有个好奇的地方，

假如出现问题的原因是上述提到的第一点：

DNS lookup - resolving a hostname might hang. This has a serious side effect in MRI of locking up everything because of the way MRI uses DNS by default. The solution is to run require 'resolv-replace' in your initializer, which installs a pure Ruby DNS resolver that works concurrently.

那即使 1 of 16 busy, 整个 sidekiq 也都全挂了，不能再接任务了？

＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝

刚和同时讨论了一下，并经过测试，写了一篇博客.

serco #11 January 16, 2015

#1 楼 @xiaoronglv 我猜第一种情况应该已经不会发生了

2.16.1 (sidekiq changelog)
Revert usage of resolv-replace. MRI's native DNS lookup releases the GIL.

1 likes

jimrokliu #12 January 16, 2015

我遇到过文档处理被挂住的情况。不是网络，只能另外搞 timeout。

pathbox #13 January 27, 2016

今天又看了一遍这个文章，请问用 resque 多进程，会不会遇到像 sidekiq 这样的问题

You need to Sign in before reply, if you don't have an account, please Sign up first.

20 likes

Total 13 replies

New Reply comming, click to load.