Ruby 按一定的概率随机给出键，怎么实现好

FenRagwort · 2012年03月31日 · 最后由 quakewang 回复于 2012年04月02日 · 5899 次阅读

比如有一个 hash 是这样的{a: 0.5, b: 0.15, c: 0.2,d: 0.15}，每个值是该键的概率现在想随机取出一个键，取出的概率等于它的值，如:a 有 50% 的可能被取出，:c 有 20% 的可能性。该怎么实现好呢？

4 个赞

fredwu #1 2012年03月31日

hash = { :a => 0.5, :b => 0.15, :c => 0.2, :d => 0.15 }
keys = []

hash.each do |k, v|
  (v*100).to_i.times { keys << k }
end

p keys.sample

hayeah #2 2012年03月31日

这个演算法使用相对的比例。O(N) 计算，O(1) 空间

pairs = {a: 1, b: 2}
def freq(pairs)
  total = pairs.values.inject(0) { |sum,n| sum + n }
  # pick a number (1..total) inclusive
  target = rand(total)+1
  pairs.each do |key,weight|
    if target <= weight
      return key
    else
      target -= weight
    end
  end
end

fredwu #3 2012年03月31日

再来个单行的；）

hash = { :a => 0.5, :b => 0.15, :c => 0.2, :d => 0.15 }

p hash.to_a.map { |el| Array.new(el[1]*100, el[0]) }.flatten.sample

1 个赞

doitian #4 2012年03月31日

google Alias Method O(1) 时间

doitian #5 2012年03月31日

@ miloyip 这篇博客有很详细的说明和比较。

FenRagwort #6 2012年03月31日

#5 楼 @doitian 这个资料很好，非常感谢！

ashchan #7 2012年03月31日

#2 楼 @hayeah 跟 @fredwu 预分配 slot 那种方式比，这个即使把取 total 的过程提出来只算一遍，整体来说还是要慢太多。

hayeah #8 2012年03月31日

#7 楼 @ashchan @fredu 主要是省空间。如果用 1500:1 的比例，不用创一个 1501 长的 array

ashchan #9 2012年03月31日

#8 楼 @hayeah 相信大部分情况下用整数 100 能覆盖掉。1500 比 1 这种情况太极端了：）

FenRagwort #10 2012年03月31日

#1 楼 @fredwu 我这是举个例子，实际的哈希可能有几千个 sample，每个 sample 的概率可能 0.001，要这样生成数组，得要巨大的数组了，不划算

FenRagwort #11 2012年03月31日

#2 楼 @hayeah 这种方式我看 NLTK 里的一个功能也是如此实现的

FenRagwort #12 2012年03月31日

#4 楼 @doitian 我看不懂你链接的网页的语言，我只会 Ruby、Python、Perl，别名方法怎么样用 Ruby 实现呢？

doitian #13 2012年03月31日

#11 楼 @FenRagwort 2 分搜索的比较好实现，只要算出 accumulated prop (cumulative distribution function CDF) 就行了，像你例子就是算出平行数组

[:a, :b, :c, :d] [0.5, 0.65, 0.85, 1]

sample 个(0,1]然后在第二数组里二分查找，找出 index 从第一个数组取值，accumulate 就是到当前位置为止所有概率之和

alias method 的构造有些小复杂，以前写过个分析的文章不过 ruby 实现搜了下没找到。其实用二分足够了吧，几千也就 10~14 下。

fredwu #14 2012年03月31日

如果说是要高性能，处理大数据量的话，可以：

hash = { :a => 0.5, :b => 0.15, :c => 0.2, :d => 0.15 }
marker = rand

p hash.inject{ |s, n| s[1] > marker ? s : [n[0], s[1]+n[1]] }[0]

这个算法有个小小的弊病——严格意义上来算，由于 float 的精确度问题，结果是会有偏差的，比如：

0.5+0.15+0.2
# => 0.8500000000000001

fredwu #15 2012年03月31日

测试了一下目前为止的几种算法的性能（各执行 100000 次）。算法包括我的三个，加上@hayeah的一个。

                           user     system      total        real
fredwu's traditional   1.610000   0.010000   1.620000 (  1.622086)
fredwu's one liner     1.920000   0.020000   1.940000 (  1.946811)
fredwu's enumerator    0.180000   0.000000   0.180000 (  0.180438)
hayeah's               0.480000   0.000000   0.480000 (  0.480785)

doitian #16 2012年03月31日

那篇口水比较多看另一篇吧 http://www.cnblogs.com/miloyip/archive/2010/04/21/1717109.html 用的 javascript

hayeah #17 2012年04月01日

#14 楼 @fredwu 赞~

vkill #18 2012年04月01日

#14 楼 @fredwu 可以使用 rand(100) 这样就不会出现 float 精度的问题了

doitian #19 2012年04月01日

#14 楼 @fredwu CDF 可以提前算，如果 PDF 提前按从大到小排序

values: :a :c :b :d PDF: 0.5 0.2 0.15 0.15 CDF: 0.5 0.7 0.85 1

线性搜索会快很多

doitian #20 2012年04月01日

hash = { :a => 0.5, :b => 0.15, :c => 0.2, :d => 0.15 }

cdf = hash.to_a.sort_by {|e| -e.last}
cumulative_probability = 0
cdf.each { |e| cumulative_probability = e[1] = e[1] + cumulative_probability }
cdf.last[1] = 1 # fix float error

marker = rand
result = cdf.find {|e| marker < e.last }
result.first

fredwu #21 2012年04月01日

@doitian 执行速度还是没有之前两个快哦。；）

                           user     system      total        real
fredwu's traditional   1.790000   0.010000   1.800000 (  1.832200)
fredwu's one liner     1.840000   0.020000   1.860000 (  1.864844)
fredwu's enumerator    0.180000   0.000000   0.180000 (  0.182008)
hayeah's               0.480000   0.000000   0.480000 (  0.478611)
doitian's              0.780000   0.000000   0.780000 (  0.785653)

doitian #22 2012年04月01日

#21 楼 @fredwu 是整个执行 100000 次，还是只是产生随机变量值的部分执行 100000 次？贴下你的测试代码看看

fredwu #23 2012年04月01日

#22 楼 @doitian 整个执行。hash.to_a_sort_by这里用了很多执行时间——其实我觉得不用 sort 啊。

fredwu #24 2012年04月01日

如果是一个很大的 hash，比如：

hash_elements = 10000
big_hash = {}
hash_elements.times { |n| big_hash[n] = 1.0/hash_elements }

Benchmark 只执行一次的话——

                           user     system      total        real
fredwu's traditional   0.010000   0.000000   0.010000 (  0.010952)
fredwu's one liner     0.010000   0.000000   0.010000 (  0.011436)
fredwu's enumerator    0.000000   0.000000   0.000000 (  0.005906)
hayeah's               0.010000   0.000000   0.010000 (  0.007418)
doitian's              0.010000   0.000000   0.010000 (  0.010415)

doitian #25 2012年04月01日

#23 楼 @fredwu 如果分布不平均，而且是构建 1 次，用 n 次的话，排序效果就会很明显，比如概率是 1 个 0.91，9 个 0.01，91% 的情况下只用比较 1 次

doitian #26 2012年04月01日

#24 楼 @fredwu 综合测试了下，包括 benchmark 和 erorr。error 就是实际产生的频率和原来概率分布之间的误差

doitian #27 2012年04月01日

忘了链接 https://gist.github.com/2272325 包括 alias method

fredwu #28 2012年04月02日

...

fredwu #29 2012年04月02日

刚才把这个当作脑筋联系题发给组里的同事们玩了下。一个同事用 Python 写了个——

from random import randint

my_dict = { 'a': 0.5, 'b': 0.15, 'c': 0.2, 'd': 0.15 }

rand_val = randint(1, 100)

start_range = 0

for element in sorted(my_dict.keys()):
    current_val = (my_dict[element] * 100) + start_range
    if start_range < rand_val <= current_val:
        print element
    start_range = current_val

quakewang #30 2012年04月02日

ruby 2.0 就会直接有内置 sample by weight 方法了 http://bugs.ruby-lang.org/issues/4147

#27 楼 @doitian alias method 这个算法很 cool 啊，花了 1 个多小时才看明白...

需要登录后方可回复, 如果你还没有账号请注册新账号