Ruby 匹配中文的正则表达片段

ruohanc · September 21, 2012 · Last by victorialice replied at March 19, 2016 · 18320 hits

今天在 twitter 上瞅见的：

"这段正则 /[一 - 龠]+/ 能匹配简体和繁体，至少 Objective-C、JavaScript 和 Ruby 都验证过了，就是不知道字数范围有多少……" -- ‏@chrisyipw 推文

ruby-1.9.3-p194

22 likes

No reference

关于正则验证写了一个小工具 Gstar,帮助你搜索在 github 上 star 过的项目分享下 ruby 匹配 @ 艾特用户名的方法

hooopo #0 September 21, 2012

("一".."龥").to_a.size
=> 20902

1 likes

bhuztez #1 September 21, 2012

这个说法是有问题的吧。看 Unicode 文档，CJK 是分成好几段的

http://www.unicode.org/versions/Unicode6.0.0/ch12.pdf

0x3400 ~ 0x4DBF
0x4E00 ~ 0x9FFF
0xF900 ~ 0xFAFF
0x20000 ~ 0x2A6DF
0x2A700 ~ 0x2B73F
0x2B740 ~ 0x2B81F
0x2F800 ~ 0x2FA1F

一 - 龠也就 4E00 ~ 9FA0 龥也就 9FA5

2 likes

hooopo #2 September 21, 2012

#2 楼 @bhuztez 是的

hbin #3 September 21, 2012

原来匹配简体和繁体的正则是这样的

luikore #4 September 23, 2012

匹配汉字用 /\p{Han}+/u 就可以了

35 likes

ruohanc #5 September 23, 2012

#5 楼 @luikore 这个神奇！哪里看到的.. :thumbsup:

luikore #6 September 23, 2012

#6 楼 @ruohanc PHP, .NET, Ruby 都能在正则用 Unicode 字符组，命令行输入 ri Regexp 在 Character Properties 段就能看到，如果没看到可以先 rvm docs generate 生成一下 Ruby 文档

还有常用的：

/\p{Word}+/u 不限于 a-z0-9 的成词字符 (就是非标点制表符空格等杂类的字符)
/\p{Hiragana,Katakana}+/u 匹配平假名＋片假名

9 likes

chrisyipw #7 March 02, 2013

汗，那条推并不是完整的，完整的在这：http://chrisyip.im/post/regular-expression-for-cjk/

Ruby 和部分语言可以直接 #{Han} 等方式匹配特定的语言，但是对于某些语言，如 JavaScript，是不可能如此简便的，我发那条推和写那篇文的目的是针对我会用到的语言。

1 likes

praguepp #8 April 14, 2013

#7 楼 @luikore 请教下，我的输入字符编码是 utf8，有"12321313"和"下载"这两种字符，在我的系统上进行匹配的时候使用/\p{Han}+/u，匹配不了，使用/[\u4e00-\u9fa5]/可以匹配，但把两种字符都匹配了，这种是否区分不了。。 if hash["serv_crc"] =~ /[\u4e00-\u9fa5]/ line[3] = crc32(hash["serv_crc"]) ATT::KeyLog::debug "serv_crc:#{hash["serv_crc"]} convert to crc 3:#{line[3]}" end

luikore #9 April 14, 2013

#9 楼 @praguepp 你的 ruby 版本是 1.8 吧... 要么升级 1.9/2.0, 要么用这个：

/(
    \xe4[\xb8-\xbf][\x80-\xbf]
    |[\xe5-\xe8][\x80-\xbf][\x80-\xbf]
    |\xe9[\x80-\xbd][\x80-\xbf]
    |\xe9\xbe[\x80-\xa5]
)+/x

多年前写的可用在 1.8 的针对各种编码的正则：https://gist.github.com/luikore/149493

praguepp #10 April 14, 2013

#10 楼 @luikore 你真厉害，得多向你学习。是 1.87。试了下，提示无效的正则表达式 test_create_mem_u_log(TestAreaCenterClientOperation): ATT::Exceptions::LoadError: loading C:/operator/keywords/helper/area_center/area_center_client_operation.rb error: C:/operator/keywords/helper/area_center/area_center_client_operation.rb:231: invalid regular expression: /(\xe4[xb8-\xbf][\x80-\xbf]|[\xe5-xe8][\x80-\xbf][\x80-\xbf]|\xe9[\x80-\xbd][\x80-\xbf]|\xe9\xbe[\x80-\xa5])+/ C:/operator/keywords/helper/area_center/area_center_client_operation.rb:237: invalid regular expression: /(\xe4[xb8-\xbf][\x80-\xbf]|[\xe5-xe8][\x80-\xbf][\x80-\xbf]|\xe9[\x80-\xbd][\x80-\xbf]|\xe9\xb e[\x80-\xa5])+/ C:/ATT_rake_server_ruby187/ruby/lib/ruby/gems/1.8/gems/att-1.1.0/lib/att/load_keyword.rb:76:in `require_file'

luikore #11 April 14, 2013

#11 楼 @praguepp 你把 [\xe5-\xe8] 写成 [\xe5-xe8] 了... 少了个反斜线...

1 likes

praguepp #12 April 14, 2013

#12 楼 @luikore 是我不够细心，下次会注意。现在可以了，谢谢

xiaoronglv #13 October 11, 2014

#5 楼 @luikore

/\p{Han}+/u 中的 /u 是什么意思呢？看了半天文档还是没懂。

luikore #14 October 12, 2014

#14 楼 @xiaoronglv 让这个正则的编码是 utf-8 的意思

/a/.encoding   # US-ASCII
/a/u.encoding # UTF-8

1 likes

victorialice #15 March 19, 2016

好神奇

wikimo in 分享下 ruby 匹配 @ 艾特用户名的方法 mention this topic. 10 Sep 23:01

kayakjiang in 写了一个小工具 Gstar,帮助你搜索在 github 上 star 过的项目 mention this topic. 03 Apr 10:57

You need to Sign in before reply, if you don't have an account, please Sign up first.

22 likes

Total 16 replies

New Reply comming, click to load.