Ruby Sunspot 学习笔记

easyhappy · 2014年09月11日 · 最后由 ronger 回复于 2017年04月25日 · 11279 次阅读

本帖已被管理员设置为精华贴

Sunspot 是什么？

Sunspot 用 Ruby 的方式实现了与 Solr 搜索引擎的交互。底层是基于 Rsolr，而且提供了很方便的 DSL 接口实现建立索引和搜索。

Sunspot 使用

gem 'sunspot_rails'
gem 'sunspot_solr', github: 'xhj/sunspot', require: 'sunspot_solr'
gem 'progress_bar'

备注：

其中 sunspot_solr 使用sunspot_solr 的衍生版本(ruby-china 上@xhj6封装的), 目的是集成 mmseg4j 1.9.1 中文分词插件 (下文会详细介绍)
添加 gem 的 progress_bar 原因是在执行 taskrake sunspot:solr:reindex时会在 terminal 上等待较长时间。

安装生成默认的配置信息

rails generate sunspot_rails:install

运行 Sunspot

rake sunspot:solr:start #后台运行
rake sunspot:solr:run   #在前台运行

rake sunspot:solr:start #后台运行

设置 model

class User
  ...
  searchable do
    text :name
  end
  ...
end

建立索引

rake sunspot:solr:reindex

在 console

s = User.search do
  fulltext '张小三'
end

puts s.results

中文分词

使用 mmseg 之前

s = Section.search do
  fulltext '张三'
end

puts s.results 为空

修改配置信息使得在 index 和 query 的时候都使用 mmseg4j 分词算法

文件 wheel-admin/solr/conf/schema.xml 62 行左右，修改为如下：

<fieldType name="text" class="solr.TextField" omitNorms="false">
  <analyzer type="index">
    <tokenizer class="com.chenlb.mmseg4j.solr.MMSegTokenizerFactory" mode="max-word" dicPath="mmseg4j_dict"/>
    <filter class="solr.StandardFilterFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="com.chenlb.mmseg4j.solr.MMSegTokenizerFactory" mode="max-word" dicPath="mmseg4j_dict"/>
    <filter class="solr.StandardFilterFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.PositionFilterFactory" />
  </analyzer>
</fieldType>

重新建立索引

rake sunspot:solr:reindex

在 wheel-web 的 console

 s = Section.search do
   fulltext '张三'
 end

 puts s.results.map(&:title)
 ["张小三",
...]

备注：默认的排序是按照得分情况进行的。比如：

puts s.hits[0].score

mmseg 的原理

mmseg 算法有两种分词方法：simple 和 complex，都是基于正向最大匹配。complex 加了四个规则过虑。官方说：词语的正确识别率达到了 98.41%。mmseg4j 已经实现了这两种分词算法。

1.5 版的分词速度 simple 算法是 1100kb/s左右、complex算法是 700kb/s左右，（测试机：AMD athlon 64 2800+ 1G 内存 xp）。
1.6 版在 complex 基础上实现了最多分词 (max-word)。“很好听” -> "很好 | 好听"; “中华人民共和国” -> "中华 | 华人 | 共和 | 国"; “中国人民银行” -> "中国 | 人民 | 银行"。
1.7-beta 版，目前 complex 1200kb/s左右, simple 1900kb/s左右, 但内存开销了 50M 左右。上几个版都是在 10M 左右。
1.8 后，增加 CutLetterDigitFilter 过虑器，切分“字母和数”混在一起的过虑器。比如：mb991ch 切为 "mb 991 ch"。

目前我们使用的时 mmseg 1.9

Sunspot 的配置参数详解

auto_commit_after_request? 每次 http 都会 commit, 默认为 true
data_path
hostname
log_file
log_level
path
port
solr_home(这个参数可能有点问题，因为每次配置之后，rake sunspot:solr:reindex 都会报 500 错误)

Sunspot 的实时性怎么样？

默认情况每次 http 都会 commit。所以实时性还是很不错的。

部署

修改 config/sunspot.yml

production:
  solr:
    hostname: localhost
    port: 8983
    log_level: WARNING
    path: /solr/default

** path: /solr/default instead of path: /solr/productionon**

利用 capistrano 部署 Solr，相关脚本如下：

after "deploy:update_code", "solr:symlink"

namespace :solr do
  desc "start solr"
  task :start, :roles => :app, :except => { :no_release => true } do 
    run "cd #{application_path} && RAILS_ENV=#{rails_env} bundle exec rake sunspot:solr:start"
  end
  desc "stop solr"
  task :stop, :roles => :app, :except => { :no_release => true } do 
    run "cd #{application_path} && RAILS_ENV=#{rails_env} bundle exec rake sunspot:solr:stop"
  end
  desc "reindex the whole database"
  task :reindex, :roles => :app do
    run "cd #{application_path} && RAILS_ENV=#{rails_env} bundle exec rake sunspot:solr:reindex"
  end
  desc "Symlink in-progress deployment to a shared Solr index"
  task :symlink, :except => { :no_release => true } do
    #创建solr所需要的目录
    run "cd #{deploy_to} && mkdir -p #{shared_path}/solr/data"
    run "cd #{deploy_to} && mkdir -p #{shared_path}/solr/pids"

    run "ln -s #{shared_path}/solr/data/ #{release_path}/solr/data"
    run "ln -s #{shared_path}/solr/pids/ #{release_path}/solr/pids"
  end
end

Solr、Sphinx、ElasticSearch 等搜索引擎比较

Solr 和 ElasticSearch 比较：

Solr 和 ElasticSearch 都是是基于 Lucene 做的，都比较容易支持所以实时更新。
字典 Solr 的不支持中文分词，但是通过添加分词算法解决问题 (比如：上面提到 mmseg4j, 或者 IKAnalyzer 分词)
由于 ElasticSearch 是在 Solr 的基础重新建立的，能够很方便的实现分布式，而且自带分词系统。

Solr 和 Sphinx 比较：

Sphinx 的优点是，简历索引，搜索都比较快; 缺点是对实时性支持比较差，语法上相对弱一些
Solr 默认有 facet 支持。而 Shphinx 中就得做一些额外的工作才行

Example:

s = Sunspot.search(Post) do
  with(:blog_id, 1)
  facet(:category_ids)
end

# facet 告诉Solr 返回的结果包括 blog id 为1的category_ids
puts s.results[0].category_ids