新手问题 在 Rails 中使用 PostgreSQL 的全文搜索功能搜索百万条记录的表的正确姿势

baurine · 2019年02月26日 · 最后由 baurine 回复于 2019年02月27日 · 1980 次阅读

 请教大家一个问题,我现在的 rails 项目中,使用的是 PostgreSQL,有一张表,有一百万条记录,假设表名叫 movies,有 title 和 description 两列。现在我使用全文搜索功能,通过 title 或 description 搜索得到相应的记录。

第一步我使用 pg_search (https://github.com/Casecommons/pg_search) 这个 gem,给 Movie model 加上相应的 pg_search_scope:

include PgSearch
 pg_search_scope :search_by_title,
                 against: :title,
                 using: {
                   tsearch: {
                     prefix: true,
                   }
                 }
 pg_search_scope :search_by_desc,
                 against: :description,
                 using: {
                   tsearch: {
                     prefix: true,
                   }
                 }

使用 Movie.search_by_title(search_value) 或 Movie.search_by_desc(search_value),需要几十秒甚至数分钟钟才能得到结果,如下图所示:

我明白这是由于没有建立索引的缘故,于是我按照 pg_search wiki 上的指南 (https://github.com/Casecommons/pg_search/wiki/Building-indexes),为这两列建立全文搜索的列及索引。

def change
  add_column :movies, :tsv_title, :tsvector
  add_index  :movies, :tsv_title, using: 'gin'

  say_with_time("Adding trigger function on movies for updating tsv_title column") do
    sql = <<-MIGRATION
      CREATE TRIGGER tsv_for_ep_title BEFORE INSERT OR UPDATE
      ON movies FOR EACH ROW EXECUTE PROCEDURE
      tsvector_update_trigger(tsv_title, 'pg_catalog.simple', title);
    MIGRATION
    execute(sql)
  end

  add_column :movies, :tsv_description, :tsvector
  add_index  :movies, :tsv_description, using: 'gin'

  say_with_time("Adding trigger function on movies for updating tsv_description column") do
    sql = <<-MIGRATION
      CREATE TRIGGER tsv_for_ep_description BEFORE INSERT OR UPDATE
      ON movies FOR EACH ROW EXECUTE PROCEDURE
      tsvector_update_trigger(tsv_description, 'pg_catalog.simple', description);
    MIGRATION
    execute(sql)
  end
end

修改 pg_search_scope:

include PgSearch
pg_search_scope :search_by_title,
                against: :title,
                using: {
                  tsearch: {
                    prefix: true,
                    tsvector_column: 'tsv_title'
                  }
                }
pg_search_scope :search_by_desc,
                against: :description,
                using: {
                  tsearch: {
                    prefix: true,
                    tsvector_column: 'tsv_description'
                  }
                }

如此操作之后,搜索速度大为改进,一般情况下几秒钟可以返回结果,好的时候几百毫秒可以返回结果。如下图所示:

那么问题是什么呢,现在我要同时通过 title 和 description 搜索,按照 pg_search 的文档,pg_search_scope 是这么写的:

pg_search_scope :search_by_title_desc,
                against: [:title, :description],
                using: {
                  tsearch: {
                    prefix: true,
                    tsvector_column: %w(tsv_title tsv_description)
                  }
                }

特别之处在于 tsvector_column 时要声明 tsv_title 和 tsv_description 两列。

同时通过 title 和 description 搜索后,搜索速度又骤降,需要几十秒才能返回结果。如下图所示。

所以请求一下大家,针对这种百万级别的表,对多列同时进行全文搜索的正确方法是什么?预期是希望在数秒内得到结果。

在上例中,我是需要同时对这两列创建一个新的 tsvector 的列并加索引吗?

谢谢!

你要把 sql log 和 explain 发上来,web 请求的截图没用啊

你的需求其实直接搜 title || desc就可以了

@hooopo 谢谢!以下是 explain: (实际项目中表是 episodes)

  • 记录总数:774825
[1] pry(main)> Episode.count
   (307.6ms)  SELECT COUNT(*) FROM "episodes"
=> 774825
  • seach_by_title
[1] pry(main)> Episode.search_by_title('adventure').explain
  Episode Load (85.9ms)  SELECT "episodes".* FROM "episodes" INNER JOIN (SELECT "episodes"."id" AS pg_search_id, (ts_rank(("episodes"."tsv_title"), (to_tsquery('simple', ''' ' || 'adventure' || ' ''' || ':*')), 2)) AS rank FROM "episodes" WHERE ((("episodes"."tsv_title") @@ (to_tsquery('simple', ''' ' || 'adventure' || ' ''' || ':*'))))) AS pg_search_a8ace1a76c218f36a59a58 ON "episodes"."id" = pg_search_a8ace1a76c218f36a59a58.pg_search_id ORDER BY pg_search_a8ace1a76c218f36a59a58.rank DESC, "episodes"."id" ASC
=> EXPLAIN for: SELECT "episodes".* FROM "episodes" INNER JOIN (SELECT "episodes"."id" AS pg_search_id, (ts_rank(("episodes"."tsv_title"), (to_tsquery('simple', ''' ' || 'adventure' || ' ''' || ':*')), 2)) AS rank FROM "episodes" WHERE ((("episodes"."tsv_title") @@ (to_tsquery('simple', ''' ' || 'adventure' || ' ''' || ':*'))))) AS pg_search_a8ace1a76c218f36a59a58 ON "episodes"."id" = pg_search_a8ace1a76c218f36a59a58.pg_search_id ORDER BY pg_search_a8ace1a76c218f36a59a58.rank DESC, "episodes"."id" ASC
                                               QUERY PLAN
---------------------------------------------------------------------------------------------------------
 Sort  (cost=4050.16..4051.01 rows=341 width=1513)
   Sort Key: (ts_rank(episodes_1.tsv_title, '''adventure'':*'::tsquery, 2)) DESC, episodes.id
   ->  Nested Loop  (cost=34.96..4035.81 rows=341 width=1513)
         ->  Bitmap Heap Scan on episodes episodes_1  (cost=34.54..1306.26 rows=327 width=103)
               Recheck Cond: (tsv_title @@ '''adventure'':*'::tsquery)
               ->  Bitmap Index Scan on index_episodes_on_tsv_title  (cost=0.00..34.45 rows=327 width=0)
                     Index Cond: (tsv_title @@ '''adventure'':*'::tsquery)
         ->  Index Scan using episodes_pkey on episodes  (cost=0.42..8.34 rows=1 width=1509)
               Index Cond: (id = episodes_1.id)
(9 rows)
  • search_by_desc
[2] pry(main)> Episode.search_by_desc('adventure').explain
  Episode Load (2663.1ms)  SELECT "episodes".* FROM "episodes" INNER JOIN (SELECT "episodes"."id" AS pg_search_id, (ts_rank(("episodes"."tsv_description"), (to_tsquery('simple', ''' ' || 'adventure' || ' ''' || ':*')), 2)) AS rank FROM "episodes" WHERE ((("episodes"."tsv_description") @@ (to_tsquery('simple', ''' ' || 'adventure' || ' ''' || ':*'))))) AS pg_search_a8ace1a76c218f36a59a58 ON "episodes"."id" = pg_search_a8ace1a76c218f36a59a58.pg_search_id ORDER BY pg_search_a8ace1a76c218f36a59a58.rank DESC, "episodes"."id" ASC
=> EXPLAIN for: SELECT "episodes".* FROM "episodes" INNER JOIN (SELECT "episodes"."id" AS pg_search_id, (ts_rank(("episodes"."tsv_description"), (to_tsquery('simple', ''' ' || 'adventure' || ' ''' || ':*')), 2)) AS rank FROM "episodes" WHERE ((("episodes"."tsv_description") @@ (to_tsquery('simple', ''' ' || 'adventure' || ' ''' || ':*'))))) AS pg_search_a8ace1a76c218f36a59a58 ON "episodes"."id" = pg_search_a8ace1a76c218f36a59a58.pg_search_id ORDER BY pg_search_a8ace1a76c218f36a59a58.rank DESC, "episodes"."id" ASC
                                                      QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
 Gather Merge  (cost=24792.34..24997.50 rows=1784 width=1513)
   Workers Planned: 1
   ->  Sort  (cost=23792.33..23796.79 rows=1784 width=1513)
         Sort Key: (ts_rank(episodes_1.tsv_description, '''adventure'':*'::tsquery, 2)) DESC, episodes.id
         ->  Nested Loop  (cost=78.94..23695.99 rows=1784 width=1513)
               ->  Parallel Bitmap Heap Scan on episodes episodes_1  (cost=78.51..10578.73 rows=1709 width=201)
                     Recheck Cond: (tsv_description @@ '''adventure'':*'::tsquery)
                     ->  Bitmap Index Scan on index_episodes_on_tsv_description  (cost=0.00..77.79 rows=2905 width=0)
                           Index Cond: (tsv_description @@ '''adventure'':*'::tsquery)
               ->  Index Scan using episodes_pkey on episodes  (cost=0.42..7.67 rows=1 width=1509)
                     Index Cond: (id = episodes_1.id)
(11 rows)
  • search_by_title_desc
[3] pry(main)> Episode.search_by_title_desc('adventure').explain
  Episode Load (10290.7ms)  SELECT "episodes".* FROM "episodes" INNER JOIN (SELECT "episodes"."id" AS pg_search_id, (ts_rank(("episodes"."tsv_title" || "episodes"."tsv_description"), (to_tsquery('simple', ''' ' || 'adventure' || ' ''' || ':*')), 2)) AS rank FROM "episodes" WHERE ((("episodes"."tsv_title" || "episodes"."tsv_description") @@ (to_tsquery('simple', ''' ' || 'adventure' || ' ''' || ':*'))))) AS pg_search_a8ace1a76c218f36a59a58 ON "episodes"."id" = pg_search_a8ace1a76c218f36a59a58.pg_search_id ORDER BY pg_search_a8ace1a76c218f36a59a58.rank DESC, "episodes"."id" ASC
=> EXPLAIN for: SELECT "episodes".* FROM "episodes" INNER JOIN (SELECT "episodes"."id" AS pg_search_id, (ts_rank(("episodes"."tsv_title" || "episodes"."tsv_description"), (to_tsquery('simple', ''' ' || 'adventure' || ' ''' || ':*')), 2)) AS rank FROM "episodes" WHERE ((("episodes"."tsv_title" || "episodes"."tsv_description") @@ (to_tsquery('simple', ''' ' || 'adventure' || ' ''' || ':*'))))) AS pg_search_a8ace1a76c218f36a59a58 ON "episodes"."id" = pg_search_a8ace1a76c218f36a59a58.pg_search_id ORDER BY pg_search_a8ace1a76c218f36a59a58.rank DESC, "episodes"."id" ASC
                                                             QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
 Gather Merge  (cost=242738.45..244331.76 rows=13656 width=1513)
   Workers Planned: 2
   ->  Sort  (cost=241738.43..241755.50 rows=6828 width=1513)
         Sort Key: (ts_rank((episodes_1.tsv_title || episodes_1.tsv_description), '''adventure'':*'::tsquery, 2)) DESC, episodes.id
         ->  Nested Loop  (cost=0.42..236799.08 rows=6828 width=1513)
               ->  Parallel Seq Scan on episodes episodes_1  (cost=0.00..198085.56 rows=6542 width=300)
                     Filter: ((tsv_title || tsv_description) @@ '''adventure'':*'::tsquery)
               ->  Index Scan using episodes_pkey on episodes  (cost=0.42..5.91 rows=1 width=1509)
                     Index Cond: (id = episodes_1.id)
(9 rows)
baurine 回复

后面两个没利用到 idx 啊

在上例中,我是需要同时对这两列创建一个新的 tsvector 的列并加索引吗?

你的过滤条件是什么,就在上面加表达式索引就好了,其实没必要加一个 tsv 列,还需要去同步。

继续使用 pg_search 的话,其实你完全可以冗余一个列叫 title_and_desc,然后和单字段一样的做法。

@hooopo 好的,谢谢了,我也看到了你发到一篇关于 PostgreSQL 全文搜索的贴子 (https://ruby-china.org/topics/38153),我再研究一下。

需要 登录 后方可回复, 如果你还没有账号请 注册新账号