顾名思义是个 trie 的实现... 的包装
使用了当代最牛数据结构之一:HAT trie, 速度完爆双数组,内存踢飞哈希表。
gem ins triez
如果你的程序有巨大 (百万数据之类的) 的 hash, 而且 hash 的键值都是字符串,而且你不关心插入顺序,可以用 triez 替代,可以节省一些内存。
h = Triez.new value_type: :object, default: 'default'
h['key'] = 'value'
h['key'] #=> 'value'
h.has_key? 'key' #=> true
h['not exist key'] #=> 'default'
支持前缀搜索,利用后缀树 + 前缀搜索,就能做高效率的全文搜索了:
require 'triez'
# 3 个 DNA 序列, id 分别为 1, 2, 3
sequences = {
'ACTGAAAAAAACTG' => 1,
'ATACGGTCCA' => 2,
'GCTTGTACGT' => 3
}
t = Triez.new
# 插入文档, 建个后缀树
sequences.each do |seq, id|
t.change_all(:suffix, seq){ id }
end
# 前缀搜索 (算法复杂度是 O(n), n = 查询字符串的长度而不是文档长度哦)
t.search_with_prefix 'CGGT' do |_, id|
puts id #=> 2
end
API 详见 https://github.com/luikore/triez
一个朴素的全文搜索服务器: https://github.com/luikore/triez/tree/master/examples/full-text-search-server
当然一般文档多了,也未必能全部挤进内存里... 如果有 魔理沙 那样用上 mmap 写法的话才能做全文搜索索引... 而且 HAT trie 的内存占用虽然小,还是没有 MARISA 那种不可修改的数据结构小。
还能用来查找最大公共子串哦:
sentences = %w[
万塘路一锅鸡
文二路一锅鸡
来一锅鸡顶盒
一锅鸡胗
]
# 值是用任意长整数代表的 bitset
t = Triez.new value_type: :object, default: 0
sentences.each_with_index do |sentence, i|
elem = 1 << i
t.change_all :substring, sentence do |v|
# 并集
v | elem
end
end
# lcs = longest common substring
lcs = ''
# 全集
universe = (1 << sentences.size) - 1
t.each do |k, v|
lcs = k if (k.size > lcs.size and v == universe)
end
puts lcs #=> 一锅鸡
#1 楼 @hooopo #2 楼 @whitecrow #3 楼 @fleuria
我谨代表 @luikore 鄙视你们这些 Mark 党..
› gem i triez
Fetching: triez-0.1.gem (100%)
Building native extensions. This could take a while...
ERROR: Error installing triez:
ERROR: Failed to build gem native extension.
/Users/saito/.rvm/rubies/ruby-1.9.3-p374/bin/ruby extconf.rb
creating Makefile
/usr/local/bin/gcc-4.2 -O3 -c -I.. ../hat-trie/ahtable.c ../hat-trie/hat-trie.c ../hat-trie/murmurhash3.c
In file included from ../hat-trie/ahtable.h:28,
from ../hat-trie/ahtable.c:8:
../hat-trie/pstdint.h:711: error: conflicting types for ‘uintptr_t’
/usr/include/i386/types.h:109: error: previous declaration of ‘uintptr_t’ was here
../hat-trie/pstdint.h:712: error: conflicting types for ‘intptr_t’
/usr/include/i386/types.h:105: error: previous declaration of ‘intptr_t’ was here
In file included from ../hat-trie/ahtable.h:28,
from ../hat-trie/hat-trie.c:9:
../hat-trie/pstdint.h:711: error: conflicting types for ‘uintptr_t’
/usr/include/i386/types.h:109: error: previous declaration of ‘uintptr_t’ was here
../hat-trie/pstdint.h:712: error: conflicting types for ‘intptr_t’
/usr/include/i386/types.h:105: error: previous declaration of ‘intptr_t’ was here
In file included from ../hat-trie/murmurhash3.h:7,
from ../hat-trie/murmurhash3.c:4:
../hat-trie/pstdint.h:711: error: conflicting types for ‘uintptr_t’
/usr/include/i386/types.h:109: error: previous declaration of ‘uintptr_t’ was here
../hat-trie/pstdint.h:712: error: conflicting types for ‘intptr_t’
/usr/include/i386/types.h:105: error: previous declaration of ‘intptr_t’ was here
ar -r libtries.a
ar: no archive members specified
usage: ar -d [-TLsv] archive file ...
ar -m [-TLsv] archive file ...
ar -m [-abiTLsv] position archive file ...
ar -p [-TLsv] archive [file ...]
ar -q [-cTLsv] archive file ...
ar -r [-cuTLsv] archive file ...
ar -r [-abciuTLsv] position archive file ...
ar -t [-TLsv] archive [file ...]
ar -x [-ouTLsv] archive [file ...]
make
compiling hat-stub.c
compiling triez.cc
cc1plus: warning: command line option "-Wdeclaration-after-statement" is valid for C/ObjC but not for C++
cc1plus: warning: command line option "-Wimplicit-function-declaration" is valid for C/ObjC but not for C++
linking shared-object triez.bundle
ld: library not found for -ltries
collect2: ld returned 1 exit status
make: *** [triez.bundle] Error 1
Gem files will remain installed in /Users/saito/.rvm/gems/ruby-1.9.3-p374/gems/triez-0.1 for inspection.
Results logged to /Users/saito/.rvm/gems/ruby-1.9.3-p374/gems/triez-0.1/ext/gem_make.out
那个 pstdint.h
是 hat-trie 带的,文件上头注释里写了这么长说这个是为了强大的可移植性,结果开门就引了个 windows 绝对没有的 signal.h
囧死...
gem ins triez
Fetching: triez-0.1.gem (100%)
Building native extensions. This could take a while...
Successfully installed triez-0.1
1 gem installed
[2] pry(main)> require "triez"
=> true
[3] pry(main)> t = Triez.new suffix: true
=> #<Triez:0x007f88cc936ff0>
[4] pry(main)> sequences = {
[4] pry(main)* 'ACTGAAAAAAACTG' => 1,
[4] pry(main)* 'ATACGGTCCA' => 2,
[4] pry(main)* 'GCTTGTACGT' => 3
[4] pry(main)* }
=> {"ACTGAAAAAAACTG"=>1, "ATACGGTCCA"=>2, "GCTTGTACGT"=>3}
[5] pry(main)> sequences.each do |seq, id|
[5] pry(main)* t[seq] = id
[5] pry(main)* end
=> {"ACTGAAAAAAACTG"=>1, "ATACGGTCCA"=>2, "GCTTGTACGT"=>3}
[6] pry(main)> t.search_with_prefix 'CGGT' do |_, id|
[6] pry(main)* puts id #=> 2
[6] pry(main)* end
2
话说这个方法 search_with_prefix
是不是应该叫 fuzzy_search
..
这不是 prefix_search 啊..
[7] pry(main)> t.search_with_prefix 'AC' do |_, id|
[7] pry(main)* puts id
[7] pry(main)* end
1
1
3
2
每个字符串不应该 search 两次?还是这才是正确的行为。
我这里:
$ uname -a
Linux John-ThinkPad-X220 3.5.0-23-generic #35-Ubuntu SMP Thu Jan 24 13:15:40 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
$ gem install triez
Fetching: triez-0.2.gem (100%)
Building native extensions. This could take a while...
ERROR: Error installing triez:
ERROR: Failed to build gem native extension.
/home/john/.rvm/rubies/ruby-1.9.3-p327-falcon/bin/ruby extconf.rb
creating Makefile
gcc -O3 -std=c99 -Wall -pedantic -c -I.. ../hat-trie/murmurhash3.c ../hat-trie/hat-trie.c ../hat-trie/ahtable.c
ar -r libtries.a murmurhash3.o hat-trie.o ahtable.o
ar: 正在创建 libtries.a
make
compiling hat-stub.c
compiling triez.cc
cc1plus: 警告: command line option ‘-Wdeclaration-after-statement’ is valid for C/ObjC but not for C++ [默认启用]
cc1plus: 警告: command line option ‘-Wimplicit-function-declaration’ is valid for C/ObjC but not for C++ [默认启用]
linking shared-object triez.so
/usr/bin/ld: build/libtries.a(hat-trie.o): relocation R_X86_64_32 against `.rodata' can not be used when making a shared object; recompile with -fPIC
build/libtries.a: 无法读取符号: 错误的值
collect2: 错误: ld 返回 1
make: *** [triez.so] 错误 1
Gem files will remain installed in /home/john/.rvm/gems/ruby-1.9.3-p327-falcon/gems/triez-0.2 for inspection.
Results logged to /home/john/.rvm/gems/ruby-1.9.3-p327-falcon/gems/triez-0.2/ext/gem_make.out
好东西收藏 不过,我觉得,下面这个例子更好理解:
require 'triez'
words = %w[readme, rot, red, rah, rasterization]
t = Triez.new
words.each do |word|
t[word] = 1
end
t.search_with_prefix 're' do |word|
puts "candidate: #{word}"
end
输出: candidate: readme candidate: red
tt = Triez.new obj_value: true ArgumentError: Unknown options: [:obj_value], only [:value_type, :default] are allowed from /Users/wenleslie/.rvm/gems/ruby-1.9.3-p374/gems/triez-1.0/lib/triez.rb:27:in `initialize'
版本问题?