Gem 刚发了个 gem: triez (加了个全文搜索服务器的例子)

luikore · 2013年02月04日 · 最后由 luikore 回复于 2013年02月13日 · 6510 次阅读
本帖已被管理员设置为精华贴

顾名思义是个 trie 的实现... 的包装

使用了当代最牛数据结构之一:HAT trie, 速度完爆双数组,内存踢飞哈希表。

gem ins triez

如果你的程序有巨大 (百万数据之类的) 的 hash, 而且 hash 的键值都是字符串,而且你不关心插入顺序,可以用 triez 替代,可以节省一些内存。

h = Triez.new value_type: :object, default: 'default'
h['key'] = 'value'
h['key'] #=> 'value'
h.has_key? 'key' #=> true
h['not exist key'] #=> 'default'

支持前缀搜索,利用后缀树 + 前缀搜索,就能做高效率的全文搜索了:

require 'triez'

# 3 个 DNA 序列, id 分别为 1, 2, 3
sequences = {
  'ACTGAAAAAAACTG' => 1,
  'ATACGGTCCA' => 2,
  'GCTTGTACGT' => 3
}

t = Triez.new

# 插入文档, 建个后缀树
sequences.each do |seq, id|
  t.change_all(:suffix, seq){ id }
end

# 前缀搜索 (算法复杂度是 O(n), n = 查询字符串的长度而不是文档长度哦)
t.search_with_prefix 'CGGT' do |_, id|
  puts id #=> 2
end

API 详见 https://github.com/luikore/triez

一个朴素的全文搜索服务器: https://github.com/luikore/triez/tree/master/examples/full-text-search-server

当然一般文档多了,也未必能全部挤进内存里... 如果有 魔理沙 那样用上 mmap 写法的话才能做全文搜索索引... 而且 HAT trie 的内存占用虽然小,还是没有 MARISA 那种不可修改的数据结构小。


还能用来查找最大公共子串哦:

sentences = %w[
  万塘路一锅鸡
  文二路一锅鸡
  来一锅鸡顶盒
  一锅鸡胗
]

# 值是用任意长整数代表的 bitset
t = Triez.new value_type: :object, default: 0

sentences.each_with_index do |sentence, i|
  elem = 1 << i
  t.change_all :substring, sentence do |v|
    # 并集
    v | elem
  end
end

# lcs = longest common substring
lcs = ''
# 全集
universe = (1 << sentences.size) - 1
t.each do |k, v|
  lcs = k if (k.size > lcs.size and v == universe)
end
puts lcs #=> 一锅鸡

@hooopo @whitecrow @fleuria 其实发出来是想看有没有各种编译不了或者 segfault 的...

#1 楼 @hooopo #2 楼 @whitecrow #3 楼 @fleuria

我谨代表 @luikore 鄙视你们这些 Mark 党..

› gem i triez
Fetching: triez-0.1.gem (100%)
Building native extensions.  This could take a while...
ERROR:  Error installing triez:
    ERROR: Failed to build gem native extension.

        /Users/saito/.rvm/rubies/ruby-1.9.3-p374/bin/ruby extconf.rb
creating Makefile
/usr/local/bin/gcc-4.2 -O3 -c -I.. ../hat-trie/ahtable.c ../hat-trie/hat-trie.c ../hat-trie/murmurhash3.c
In file included from ../hat-trie/ahtable.h:28,
                 from ../hat-trie/ahtable.c:8:
../hat-trie/pstdint.h:711: error: conflicting types for ‘uintptr_t’
/usr/include/i386/types.h:109: error: previous declaration of ‘uintptr_t’ was here
../hat-trie/pstdint.h:712: error: conflicting types for ‘intptr_t’
/usr/include/i386/types.h:105: error: previous declaration of ‘intptr_t’ was here
In file included from ../hat-trie/ahtable.h:28,
                 from ../hat-trie/hat-trie.c:9:
../hat-trie/pstdint.h:711: error: conflicting types for ‘uintptr_t’
/usr/include/i386/types.h:109: error: previous declaration of ‘uintptr_t’ was here
../hat-trie/pstdint.h:712: error: conflicting types for ‘intptr_t’
/usr/include/i386/types.h:105: error: previous declaration of ‘intptr_t’ was here
In file included from ../hat-trie/murmurhash3.h:7,
                 from ../hat-trie/murmurhash3.c:4:
../hat-trie/pstdint.h:711: error: conflicting types for ‘uintptr_t’
/usr/include/i386/types.h:109: error: previous declaration of ‘uintptr_t’ was here
../hat-trie/pstdint.h:712: error: conflicting types for ‘intptr_t’
/usr/include/i386/types.h:105: error: previous declaration of ‘intptr_t’ was here
ar -r libtries.a
ar: no archive members specified
usage:  ar -d [-TLsv] archive file ...
    ar -m [-TLsv] archive file ...
    ar -m [-abiTLsv] position archive file ...
    ar -p [-TLsv] archive [file ...]
    ar -q [-cTLsv] archive file ...
    ar -r [-cuTLsv] archive file ...
    ar -r [-abciuTLsv] position archive file ...
    ar -t [-TLsv] archive [file ...]
    ar -x [-ouTLsv] archive [file ...]

make
compiling hat-stub.c
compiling triez.cc
cc1plus: warning: command line option "-Wdeclaration-after-statement" is valid for C/ObjC but not for C++
cc1plus: warning: command line option "-Wimplicit-function-declaration" is valid for C/ObjC but not for C++
linking shared-object triez.bundle
ld: library not found for -ltries
collect2: ld returned 1 exit status
make: *** [triez.bundle] Error 1


Gem files will remain installed in /Users/saito/.rvm/gems/ruby-1.9.3-p374/gems/triez-0.1 for inspection.
Results logged to /Users/saito/.rvm/gems/ruby-1.9.3-p374/gems/triez-0.1/ext/gem_make.out

#5 楼 @Saito very helpful! 回帖的模范!

那个 pstdint.h 是 hat-trie 带的,文件上头注释里写了这么长说这个是为了强大的可移植性,结果开门就引了个 windows 绝对没有的 signal.h 囧死...

gem ins triez
Fetching: triez-0.1.gem (100%)
Building native extensions.  This could take a while...
Successfully installed triez-0.1
1 gem installed

ubuntu 装上毫不费力,不过弄懂了要费一番力气了

@Saito fixed, version=0.2 了

#10 楼 @luikore Awesome! Great! cool! Good job!

› gem i tries  
Fetching: tries-0.2.0.gem (100%)
Successfully installed tries-0.2.0
1 gem installed
[2] pry(main)> require "triez"
=> true
[3] pry(main)> t = Triez.new suffix: true
=> #<Triez:0x007f88cc936ff0>
[4] pry(main)> sequences = {
[4] pry(main)*   'ACTGAAAAAAACTG' => 1,  
[4] pry(main)*   'ATACGGTCCA' => 2,  
[4] pry(main)*   'GCTTGTACGT' => 3  
[4] pry(main)* }  
=> {"ACTGAAAAAAACTG"=>1, "ATACGGTCCA"=>2, "GCTTGTACGT"=>3}
[5] pry(main)> sequences.each do |seq, id|
[5] pry(main)*   t[seq] = id  
[5] pry(main)* end  
=> {"ACTGAAAAAAACTG"=>1, "ATACGGTCCA"=>2, "GCTTGTACGT"=>3}
[6] pry(main)> t.search_with_prefix 'CGGT' do |_, id|
[6] pry(main)*   puts id #=> 2  
[6] pry(main)* end  
2

话说这个方法 search_with_prefix 是不是应该叫 fuzzy_search..

这不是 prefix_search 啊..

[7] pry(main)> t.search_with_prefix 'AC' do |_, id|
[7] pry(main)*   puts id
[7] pry(main)* end  
1
1
3
2

每个字符串不应该 search 两次?还是这才是正确的行为。

#13 楼 @Saito 这是个 suffix trie, 插入 'ACACD' 话相当于在正常 trie 里插入 'ACACD', 'CACD', 'ACD', 'CD', 'D' 这 5 项 (参考 readme 里万塘路一锅鸡的例子), 所以前缀搜索就能实现全文搜索的效果...

P.S. 是 triez 不是 tries 啊...

#14 楼 @luikore 好吧,从 trie 看确实是 search_with_prefix 的搜索..

貌似一切都说的通了..

包装 Triez 的 gem 可以是 fuzzy_search ..

我这里:

$ uname -a
Linux John-ThinkPad-X220 3.5.0-23-generic #35-Ubuntu SMP Thu Jan 24 13:15:40 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
$ gem install triez
Fetching: triez-0.2.gem (100%)
Building native extensions.  This could take a while...
ERROR:  Error installing triez:
    ERROR: Failed to build gem native extension.

        /home/john/.rvm/rubies/ruby-1.9.3-p327-falcon/bin/ruby extconf.rb
creating Makefile
gcc -O3 -std=c99 -Wall -pedantic -c -I.. ../hat-trie/murmurhash3.c ../hat-trie/hat-trie.c ../hat-trie/ahtable.c
ar -r libtries.a murmurhash3.o hat-trie.o ahtable.o
ar: 正在创建 libtries.a

make
compiling hat-stub.c
compiling triez.cc
cc1plus: 警告: command line option ‘-Wdeclaration-after-statement’ is valid for C/ObjC but not for C++ [默认启用]
cc1plus: 警告: command line option ‘-Wimplicit-function-declaration’ is valid for C/ObjC but not for C++ [默认启用]
linking shared-object triez.so
/usr/bin/ld: build/libtries.a(hat-trie.o): relocation R_X86_64_32 against `.rodata' can not be used when making a shared object; recompile with -fPIC
build/libtries.a: 无法读取符号: 错误的值
collect2: 错误: ld 返回 1
make: *** [triez.so] 错误 1


Gem files will remain installed in /home/john/.rvm/gems/ruby-1.9.3-p327-falcon/gems/triez-0.2 for inspection.
Results logged to /home/john/.rvm/gems/ruby-1.9.3-p327-falcon/gems/triez-0.2/ext/gem_make.out

win xp 下安装成功了

#16 楼 @fsword 好生奇怪... 一般这种问题是只有 RHEL 才有才对... 已 fix, 现在 0.3 了 @keating 同样是 ubuntu... @ywjno xp 威武

Windows 7 32 位 安装没问题

不明觉厉……

好东西收藏 不过,我觉得,下面这个例子更好理解:

require 'triez'
words = %w[readme, rot, red, rah, rasterization]
t = Triez.new
words.each do |word|
  t[word] = 1
end
t.search_with_prefix 're' do |word|
  puts "candidate: #{word}"
end

输出: candidate: readme candidate: red

魔理沙。。。第一个人就想到东方里面的黑白了

不过给的那个 link 在解析出来的时候出现错误了,@luikore

呵呵 学 C 真好 我也要学 呼呼~~

#22 楼 @ywjno thanks, 已修正

#25 楼 @i5ting 我只是利用了大神的成果而已... 翻译如下:

简介:trie 这个主意看起来很简单,但是指针吃了很多内存,burst trie 做得不错,说不定能和几年前发的 array hash 结合起来 背景:列了很多年份 HAT-trie: 主意就是把 burst trie 的二叉搜索树替换成 array hash 的 bucket, 看那张图就可以了... 设计:搞了两个版本,pure 版压缩性一般,hybrid 版不错 性能评测:另外由于链表版二叉搜索树是内存杀手,还专门写了压缩版的二叉搜索树来说明它不能行

tt = Triez.new obj_value: true ArgumentError: Unknown options: [:obj_value], only [:value_type, :default] are allowed from /Users/wenleslie/.rvm/gems/ruby-1.9.3-p374/gems/triez-1.0/lib/triez.rb:27:in `initialize'

版本问题?

#27 楼 @googya 是的,名字不太好,1.0 改了-_- Triez.new value_type: :object

加了个只做短语切分而不分词的全文搜索服务器的简单例子... 刚测试了索引卫斯理全集效果还行...

#29 楼 @luikore 卫斯理索引树建好后,内存消耗了多少啊

#30 楼 @hlxwell 30M 的小说,内存索引吃了 190M -_-

@luikore 那你能持久化到 redis 吗,用硬盘作内存的扩展。

#32 楼 @hlxwell 不能... 不过可以在插入时试试把超长长字符串的后面部分截断持久化

需要 登录 后方可回复, 如果你还没有账号请 注册新账号