Gem 写了一个中文分词的 gem——nlpir

wujian_hit · 2013年07月29日 · 最后由 Martin91 回复于 2014年05月13日 · 12238 次阅读

nlpir gem 封装自中科院中文分词 ICTCLAS2014

ICTCLAS 是由中国科学院计算技术研究所研发。功能包括中文分词；词性标注；命名实体识别；新词识别；同时支持用户词典。目前性能最好的中文分词工具。

nlpir 使用 ruby2.0 fiddle 模块封装，最新版本 1.0.0 封装自 ICTCLAS2014 最新版本，完美支持 linux x86 x64，之前反应的用户词典 bug 已经修复。新版的 gem 提供了 ruby 风格的函数。

gem install nlpir

usage:


require 'nlpir'
include Nlpir

s="坚定不移沿着中国特色社会主义道路前进  为全面建成小康社会而奋斗"

#在当前文件夹路径下初始化
nlpir_init(File.expand_path("../", __FILE__),UTF8_CODE)

#处理字符串
text_proc(s) 

#处理文件
file_proc("example.txt","result.txt") 

#导入用户词典
import_userdict("./userdict.txt")
text_proc(s)

nlpir_exit()

欢迎各位继续丰富 ruby 在自然语言处理领域的包

用法和 test 中的用法 [email protected] nlpir@github

有人问性能，官方文档里其实有测试数据，ICTCLAS 测试：

29 个赞

无引用文章

中文分词 gem－nlpir 已与 ICTCLAS2014 同步，所有 bug 均已修复。 Ruby 的机器学习项目中文分词 gem——nlpir for win 已发布

discover #0 2013年07月29日

赞一个。学长哪个实验室的？

wujian_hit #1 2013年07月29日

#1 楼 @discover 10 级

discover #2 2013年07月29日

#2 楼 @wujian_hit 。。。我是学长，学弟加油

2 个赞

wujian_hit #3 2013年07月29日

#3 楼 @discover 学长好！

blackanger #4 2013年07月29日

Good Job

kamiiyu #5 2013年07月29日

例子挺有内涵的

huacnlee #6 2013年07月29日

看了一下 README 上面的使用方法，好复杂啊，不能简单一点么，另外

NLPIR_AddUserWord
NLPIR_ParagraphProcess
NLPIR_ImportUserDict
...

这里方法命名一点也不 Ruby，用起来很奇怪

wujian_hit #7 2013年07月29日

#7 楼 @huacnlee 取名这样是为了和官方文档的介绍保持一致。

1 个赞

wujian_hit #8 2013年07月29日

#6 楼 @kamiiyu 嘿嘿。低调低调

kai1248 #9 2013年07月29日

厉害！赞一个

luikore #10 2013年07月29日

哇，现在 ICTCLAS 竟然可以在非 windows 用了...

luikore #11 2013年07月29日

是不是还是和 mecab 用中文训练的结果差不多？

1 个赞

wujian_hit #12 2013年07月29日

#12 楼 @luikore ICTCLAS2013 ,pure c 编译的.so

wujian_hit #13 2013年07月29日

#7 楼 @huacnlee 先 init 一下，然后 call method u want，最后 exit 一下释放资源。原本也想直接把这个封装进函数，但是考虑到可能会丧失灵活性和降低性能，最后还是决定让大家调用的时候自己初始化和释放比较好。也很无奈啊～c就是这样子。。

u1371780084 #14 2013年07月29日

ICTCLAS2013 这个分词还是蛮好的。不知道效率如何

wujian_hit #15 2013年07月29日

#15 楼 @u1371780084 分词速度单机 996KB/s，分词精度 98.45%，

u1371780084 #16 2013年07月29日

内存使用情况呢？

wujian_hit #17 2013年07月29日

#17 楼 @u1371780084 帖子更新了，您关心的性能问题，可以看一下。

ganweiliang #18 2013年07月29日

加油

cisolarix #19 2013年07月29日

#6 楼 @kamiiyu 什么内涵？没看不来。请指教。

scott #20 2013年07月30日

赞一个，准备试用一下。

u1360749170 #21 2013年07月31日

进去就看到这个例子， require 'nlpir' include Nlpir

s = "坚定不移沿着中国特色社会主义道路前进为全面建成小康社会而奋斗"

wujian_hit #22 2013年07月31日

#22 楼 @u1360749170 四野之内，六合之间，必是我党笑傲江湖~

yujing_z #23 2013年08月01日

#8 楼 @wujian_hit 可以加点 alias

wujian_hit #24 2013年08月01日

#24 楼 @Yujing_Z 好主意！

staticor #25 2013年08月03日

现在能支持 brew 的安装模式吗

wujian_hit #26 2013年08月03日

#26 楼 @staticor 我没有 mac 机，不过 brew 应该和 ubuntu 的 apt-get 是一种类型的工具，用于安装软件。这个只是一个 gem。。。gem install 就 ok.

bhuztez #27 2013年08月03日

我就是不喜欢 HHMM，不要问我为什么。

bhuztez #28 2013年08月03日

#11 楼 @luikore 很久很久以前就可以了。只不过只能运行在 Java 平台上

https://code.google.com/p/ictclas4j/

boostbob #29 2013年08月06日

可以用在 tire 和 elasticsearch 哇？

CHAO_AERO #30 2013年08月07日

赞一个，写得蛮好的

wujian_hit #31 2013年08月09日

#30 楼 @boostbob 请教一下 elasticsearch 需要分词模块完成一些特定的接口标准吗？

kenshin54 #32 2013年08月09日

#32 楼 @wujian_hit 一般会继承org.apache.lucene.analysis.Tokenizer这个抽象类写一个 Tokenizer，然后写一个类继承org.elasticsearch.index.analysis.AbstractTokenizerFactory，做个工厂类，这个可以通过 elasticsearch 的 index 配置，传入一些特定参数，然后传给你的 Tokenizer，再写一个类继承AnalysisModule.AnalysisBinderProcessor用来在 elasticsearch 启动时，注册你的 TokenizerFactory。这是基本流程。。。伟大的爪哇

whh #33 2013年08月16日

这个能够识别一些到品牌关键字吗？如汽车品牌：宝来，腾翼。我到官网试了一下好像不行

wujian_hit #34 2013年08月16日

#34 楼 @whh 识别这些词汇这个需要专门的行业词库。

changwu #35 2013年11月18日

/home/changwu/.rvm/gems/ruby-2.0.0-p0/gems/nlpir-0.0.3-x86-linux/lib/nlpir.rb:177:in to_s': NULL pointer given (ArgumentError) from /home/changwu/.rvm/gems/ruby-2.0.0-p0/gems/nlpir-0.0.3-x86-linux/lib/nlpir.rb:177:inNLPIR_ParagraphProcess' from split_words.rb:11:in `

' 遇到这样的错误，不知道是怎么回事？ #35 楼 @wujian_hit

代码如下：

#encoding : utf-8
require 'nlpir'
include Nlpir

s = "坚定不移沿着中国特色社会主义道路前进  为全面建成小康社会而奋斗"
 #first of all : Call the NLPIR API NLPIR_Init

NLPIR_Init(nil, UTF8_CODE , File.expand_path("../", __FILE__))

#example1:   Process a paragraph, and return the result text with POS or not
puts puts NLPIR_ParagraphProcess("1989年春夏之交的政治风波1989年政治风波24小时降
雪量24小时降雨量863计划ABC防护训练APEC会议BB机BP机C2系统C3I系统C3系统C4ISR系统C4I系统CCITT建议")

bao1018 #36 2013年11月28日

@wujian_hit when I tried to install with my Ruby 2,0 got below error: any suggestion on this? Could not find nlpir-0.0.3 in any of the sources

wujian_hit #37 2013年12月04日

#36 楼 @changwu 你好，这是因为包的授权期限到了。已经 push 了最新的 gem-x86-linux-0.1.0.

wujian_hit #38 2013年12月04日

#37 楼 @bao1018 it just support x86 platform windows and linux , the reason is NLPIR2013 for x64 has much serious bugs ...... But a good news is the official team of NLPIR will release the perfect version-2014 at 2014.Dec... and I will release the no-bug gem in time. thank you support.

wujian_hit #39 2013年12月04日

#36 楼 @changwu 网络原因 push 不上去。。。。请稍等。

martin91 #40 2014年05月12日

这个 gem 还不能支持 x64 系统吗？mac 下安装出现：

gem install nlpir
ERROR:  Could not find a valid gem 'nlpir' (>= 0), here is why:
          Found nlpir (1.0.0), but was for platforms x86_64-linux ,x86-linux ,x86-mingw32

欲哭无泪，最近毕设可能需要处理分词。

wujian_hit #41 2014年05月13日

#41 楼 @Martin91 支持 64 位 linux 和 windows，不支持 mac，原因是官方没有在 mac 上编译成*.dylib

martin91 #42 2014年05月13日

#42 楼 @wujian_hit 嗯嗯，感谢回答

ouyang 在 Ruby 的机器学习项目提及了此话题。 12月08日 19:41

wujian_hit 在中文分词 gem——nlpir for win 已发布提及了此话题。 04月03日 10:57

wujian_hit 在中文分词 gem－nlpir 已与 ICTCLAS2014 同步，所有 bug 均已修复。提及了此话题。 04月03日 10:57

需要登录后方可回复, 如果你还没有账号请注册新账号

29 个赞

共收到 43 条回复

收到新回复，点击立即加载