搜索引擎 MongoDB + Rails 有什么好的全文搜索的办法吗？

richarddong · 2012年04月18日 · 最后由 test1ok 回复于 2014年08月30日 · 15083 次阅读

求大神指导！！

我的环境：

Ruby 1.9.3p125
Rails 3.1.3
MongoDB 2.0.2
Mongoid 2.4.7
Unicorn + Nginx

听说过 solr, sphinx, sunspot, elasticsearch 等等。。

他们的关系是什么？怎么用？

而且好像上面都是要用到 Java 的？

14 个赞

做了一个脚本，方便大家用 Sunpot 做中文全文索引

无引用文章

Rei #0 2012年04月18日

可以看看 ruby-china 的实现，页脚有 github 链接

huacnlee #1 2012年04月19日

Sphinx 实现很困难，Solr 可行

lgn21st #2 2012年04月19日

推荐 sunspot，是目前最好的 Rails 上的 search/full-text-search 解决方案了 http://railscasts.com/episodes/278-search-with-sunspot sunspot 是对 Solr 的 ruby 封装，而 sphinx 和 elasticsearch 是另外两个其他解决方案。

1 个赞

huacnlee #3 2012年04月19日

http://ruby-china.org/topics/305

visionwang #4 2012年04月19日

sunspot 完善些，其他两个当备选了

5 楼已删除

andrew_qx #6 2012年04月19日

选用哪个方案还是得看你的具体需求

richarddong #7 2012年04月19日

谢谢楼上各位！ @Rei @huacnlee @lgn21st @visionwang @aNdReW_Qx 研究一阵有新问题再向你们请教~

richarddong #8 2012年04月22日

@Rei @huacnlee @lgn21st @visionwang @aNdReW_Qx

就我目前的了解，似乎有两种常见的方式：

（搜索引擎 -> 搜索服务器 -> 搜索服务器的客户端封装 -> 客户端）

Lucene -> Solr -> Sunspot -> Ruby（比较成熟）
Lucene -> ElasticSearch -> Tire -> Ruby（比较新颖，但社区较小）

这两种方法似乎是：Solr 擅长建立好索引以后的简单的搜索，速度很快；但是在有频繁修改索引的情况下 Solr 表现很差。而 ES 在一般搜索时稍慢于 Solr，但在频繁修改索引的情况下 ES 要强于 Solr 很多。也就是之前有人提到过的实时性问题。

另外，这两种方法都是基于 Lucene 的，而 Lucene 是 Java 的，建立索引的性能上相比 C++ 的 Sphinx 差很多，搜索的性能也差一些。但是在数据量不是特别大的时候应该这种优势不明显。

如果想用 Sphinx 的话，能不能自己在 Rails 上写点东西作为 Sphinx 的 xmlpipe 的 source，从而把 Sphinx 和 Rails, MongoDB 结合起来？这种层面的结合会不会有性能问题？

不知道我以上的理解有没有什么问题？

4 个赞

watgon #9 2012年08月12日

@richarddong solr 比 Sphinx 的性能差？对比吗？

huacnlee #10 2012年08月13日

Sphinx 之前我研究过，和 MongoDB 整合主要的问题是 Sphinx 目前版本不支持字符型的主键，所以 MongoDb 的 ObjectId 就搞不了。据我的使用感觉，Sphinx 相比 Solr 和 Eladticsearch 是最快的。

evan #11 2012年08月13日

那如果是 mysql 是 sphinx 好还是 sunspot 好？

huacnlee #12 2012年08月13日

#12 楼 @evan Sunpot 和 Elasticsearch 安装简单，而且支持实时索引，Sphinx 1.1（目前支持中文分词的版本 Coreseek）暂时无法实现实时的。这里面 Elasticsearch 安装最简单，中文分词直接内置的。

richarddong #13 2012年08月13日

@watgon @huacnlee FYI, ElasticSearch 我们在尝试的时候感觉它和 Mongoid 的继承（类似 Single Table Inheritence）不太兼容。

huacnlee #14 2012年08月13日

@richarddong 是的，我也自己手动写了很多修正

richarddong #15 2012年09月09日

@huacnlee 能不能分享一下怎么修正？我们这边做全文搜索感觉很头疼啊。。我们用的 Tire

huacnlee #16 2012年09月09日

#16 楼 @richarddong

增加 find_in_batches 的方法

def find_in_batches(opts = {})
  batch_size = opts[:batch_size] || 1000
  start = opts.delete(:start).to_i || 0
  objects = self.limit(batch_size).skip(start)
  t = Time.new
  while objects.any?
    yield objects
    start += batch_size
    # Rails.logger.debug("processed #{start} records in #{Time.new - t} seconds") if Rails.logger.debug?
    break if objects.size < batch_size
    objects = self.limit(batch_size).skip(start)
  end
end

增加 tire.rake 的任务

# coding: utf-8
require 'benchmark'
namespace :tire do
  desc 'Create Tire Index'
  task :create_settings => :environment do
    if Tire.index('movies').create Movie.settings
      puts Tire.index('movies').settings
    else
      "false"
    end
  end

  desc 'Destroy Tire Index'
  task :drop_settings => :environment do
    puts Tire.index('movies').delete
  end

  desc 'Regenerate Tire Index for all models'
  task :update_index => :environment do
    Movie.find_in_batches(:batch_size => 1000) do |movies|
      movies.each do |movie|
        movie.update_index
        puts "#{movie.id} [Indexed]"
        movie = nil
      end
      movies = nil
    end
  end
end

3 个赞

richarddong #17 2012年09月24日

@huacnlee Thx~ 研究研究~

test1ok #18 2014年08月30日

我们现在是用 sunsport, 点击结果加权，以及人工调整来改善结果。比较费力，也不能应付大规模。有没有什么可以配置更多功能的搜索代码比如多个标签如果搜索一个其他也命中。语义库，展现但没点击的降权这一类功能

需要登录后方可回复, 如果你还没有账号请注册新账号

14 个赞

共收到 19 条回复

收到新回复，点击立即加载