Rails Rails 性能优化 - 简易模糊查找

lanzhiheng · 2022年03月10日 · 最后由 hellorails 回复于 2023年03月05日 · 767 次阅读

无论是电商,博客系统还是 Sass 似乎都少不了搜索的功能。这篇文章简单来谈谈模糊查找在 Rails 中的实现,可能篇幅会稍微有一点点长。原文连接:https://lanzhiheng.com/posts/easy-fuzzy-searching-in-rails

模糊查找

一般提到 Rails 的搜索都是指模糊查找 (Fuzzy Search),实现方式也比较简单。

a. 原生的做法

有比较原生的做法,就是直接编写数据库查询语句

> Post.where("title LIKE '%分页%'").pluck('title')
   (0.8ms)  SELECT "posts"."title" FROM "posts" WHERE (title LIKE '%分页%')
=> ["谈谈Rails中的分页-简易版", "谈谈Rails中的分页-进阶版"]

主要是通过手写LIKE语句来实现模糊查找。

b. 插件

如果不想自己写数据库LIKE语句,其实也可以通过插件来实现。Rails 生态的 Gem 提供了一个名为Ransack的 Gem。能够很方便地做到这一点

> Post.ransack(title_cont: '分页').result.pluck('title')
   (2.2ms)  SELECT "posts"."title" FROM "posts" WHERE "posts"."title" ILIKE '%分页%'
=> ["谈谈Rails中的分页-简易版", "谈谈Rails中的分页-进阶版"]

也可以对多个字段进行组合

> Post.ransack(title_or_excerpt_cont: '分页').result.pluck('title')
   (1.1ms)  SELECT "posts"."title" FROM "posts" WHERE ("posts"."title" ILIKE '%分页%' OR "posts"."excerpt" ILIKE '%分页%')
=> ["谈谈Rails中的分页-简易版", "谈谈Rails中的分页-进阶版"]

这两种做法其实只是实现方式不同,效果是一样的。具体采用哪种还是看自己习惯吧。不过一旦开始做数据库方面的优化,总是难免要手写 SQL 语句

pg_trgm 插件

当数据量比较大,要从这大量数据中搜索出想要的结果,那可能会有点问题,我首先在本地环境构建一些数据。

> Post.count
   (38.6ms)  SELECT COUNT(*) FROM "posts"
=> 723457

如今posts表的数据量是 70W 条左右(其中绝大部分的标题都包含Rails)。我想从这大量数据里面查找出posts.title包含ActiveSupport的记录。那么在我这台 M1 的笔记本上搜索时间大概是

> Post.where("title ~ 'ActiveSupport'")
  Post Load (206.4ms)  SELECT "posts".* FROM "posts" WHERE (title ~ 'ActiveSupport') LIMIT $1  [["LIMIT", 11]]

> sql = "EXPLAIN ANALYZE " + Post.where("title ~ 'ActiveSupport'").to_sql

> ActiveRecord::Base.connection.execute(sql).each {|m| puts m }
   (204.3ms)  EXPLAIN ANALYZE SELECT "posts".* FROM "posts" WHERE (title ~ 'ActiveSupport')
{"QUERY PLAN"=>"Gather  (cost=1000.00..39892.68 rows=1 width=131) (actual time=0.704..203.531 rows=2 loops=1)"}
{"QUERY PLAN"=>"  Workers Planned: 2"}
{"QUERY PLAN"=>"  Workers Launched: 2"}
{"QUERY PLAN"=>"  ->  Parallel Seq Scan on posts  (cost=0.00..38892.58 rows=1 width=131) (actual time=124.158..189.914 rows=1 loops=3)"}
{"QUERY PLAN"=>"        Filter: ((title)::text ~ 'ActiveSupport'::text)"}
{"QUERY PLAN"=>"        Rows Removed by Filter: 241152"}
{"QUERY PLAN"=>"Planning Time: 0.195 ms"}
{"QUERY PLAN"=>"Execution Time: 203.560 ms"}

M1 速度算是比较快的了,也需要 200ms 左右。加个 Gin 索引确实可以优化一下这种现象,这里还要借助pg_trgm

class AddIndexForPostTitle < ActiveRecord::Migration[6.0]
  def change
    add_index :posts, :title, using: 'gin', opclass: :gin_trgm_ops
  end
end

一般一开始运行这个脚本会报错。

PG::UndefinedObject: ERROR:  operator class "gin_trgm_ops" does not exist for access method "gin"

这里表示我们的数据库缺少pg_trgm插件,简单登录一下对应的数据库,并运行这条命令

=# CREATE EXTENSION pg_trgm;

然后再通过migration过程去添加索引即可

bin/rails db:migrate

70W 条数据添加索引确实需要点时间,看看现在的效果

> Post.where("title ~ 'ActiveSupport'").order('created_at DESC')
  Post Load (2.0ms)  SELECT "posts".* FROM "posts" WHERE (title ~ 'ActiveSupport') LIMIT $1  [["LIMIT", 11]]

> sql = "EXPLAIN ANALYZE " + Post.where("title ~ 'ActiveSupport'").order('created_at DESC').to_sql
=> "EXPLAIN ANALYZE SELECT \"posts\".* FROM \"posts\" WHERE (title ~ 'ActiveSupport')"

> ActiveRecord::Base.connection.execute(sql).each {|m| puts m }
   (0.8ms)  EXPLAIN ANALYZE SELECT "posts".* FROM "posts" WHERE (title ~ 'ActiveSupport') ORDER BY created_at DESC
{"QUERY PLAN"=>"Sort  (cost=180.02..180.03 rows=1 width=128) (actual time=0.099..0.100 rows=2 loops=1)"}
{"QUERY PLAN"=>"  Sort Key: created_at DESC"}
{"QUERY PLAN"=>"  Sort Method: quicksort  Memory: 26kB"}
{"QUERY PLAN"=>"  ->  Bitmap Heap Scan on posts  (cost=176.00..180.01 rows=1 width=128) (actual time=0.085..0.092 rows=2 loops=1)"}
{"QUERY PLAN"=>"        Recheck Cond: ((title)::text ~ 'ActiveSupport'::text)"}
{"QUERY PLAN"=>"        Heap Blocks: exact=2"}
{"QUERY PLAN"=>"        ->  Bitmap Index Scan on index_posts_on_title  (cost=0.00..176.00 rows=1 width=0) (actual time=0.070..0.070 rows=2 loops=1)"}
{"QUERY PLAN"=>"              Index Cond: ((title)::text ~ 'ActiveSupport'::text)"}
{"QUERY PLAN"=>"Planning Time: 0.229 ms"}
{"QUERY PLAN"=>"Execution Time: 0.128 ms"}

可见index_posts_on_title索引被命中,现在从大量数据中搜索出想要的结果大概是 2ms 左右,可以算是比较大的提升了。只不过这种索引也不是万能的,在一些场景中可能会起不了作用。

索引不生效的情况

以下是笔者碰到的一些索引不生效的情况,仅供参考。

符合搜索条件的数据量太多的时候

前面有说过笔者的 70W 条数据里面绝大多数都包含了关键字Rails。从这么多文章里面搜索出仅有的几篇标题包含ActiveSupport关键字的文章,索引命中,并带来巨大的性能提升。然而如果我改为搜索Rails关键字,那效果就不一样了。

> Post.where("title ~ 'Rails'").count
   (160.6ms)  SELECT COUNT(*) FROM "posts" WHERE (title ~ 'Rails')
=> 723394

> Post.where("title ~ 'Rails'").order('created_at DESC')
  Post Load (155.7ms)  SELECT "posts".* FROM "posts" WHERE (title ~ 'Rails') ORDER BY created_at DESC LIMIT $1  [["LIMIT", 11]]

> sql = "EXPLAIN ANALYZE " + Post.where("title ~ 'Rails'").order('created_at DESC').to_sql
=> "EXPLAIN ANALYZE SELECT \"posts\".* FROM \"posts\" WHERE (title ~ 'Rails')"

> ActiveRecord::Base.connection.execute(sql).each {|m| puts m } # 索引似乎没有生效
   (406.6ms)  EXPLAIN ANALYZE SELECT "posts".* FROM "posts" WHERE (title ~ 'Rails') ORDER BY created_at DESC
{"QUERY PLAN"=>"Gather Merge  (cost=86888.92..157220.50 rows=602800 width=128) (actual time=238.422..386.229 rows=723394 loops=1)"}
{"QUERY PLAN"=>"  Workers Planned: 2"}
{"QUERY PLAN"=>"  Workers Launched: 2"}
{"QUERY PLAN"=>"  ->  Sort  (cost=85888.89..86642.39 rows=301400 width=128) (actual time=233.386..249.014 rows=241131 loops=3)"}
{"QUERY PLAN"=>"        Sort Key: created_at DESC"}
{"QUERY PLAN"=>"        Sort Method: external merge  Disk: 35472kB"}
{"QUERY PLAN"=>"        Worker 0:  Sort Method: external merge  Disk: 35304kB"}
{"QUERY PLAN"=>"        Worker 1:  Sort Method: external merge  Disk: 33560kB"}
{"QUERY PLAN"=>"        ->  Parallel Seq Scan on posts  (cost=0.00..38884.01 rows=301400 width=128) (actual time=15.882..141.991 rows=241131 loops=3)"}
{"QUERY PLAN"=>"              Filter: ((title)::text ~ 'Rails'::text)"}
{"QUERY PLAN"=>"              Rows Removed by Filter: 21"}
{"QUERY PLAN"=>"Planning Time: 0.213 ms"}
{"QUERY PLAN"=>"Execution Time: 405.900 ms"}
=> #<PG::Result:0x0000000124f49ca8 status=PGRES_TUPLES_OK ntuples=13 nfields=1 cmd_tuples=0>

符合条件的数据就有 70W 条,这时候索引没有命中,查找出这些数据也比较耗时。只不过在正常的业务场景里面这种情况应该也比较少。

不是命中了索引性能就好

查询字符较少的时候,索引效果不明显。对比以下Act以及Ac的查找效果

> Post.where("title ~ 'Ac'").order('created_at DESC')
  Post Load (238.9ms)  SELECT "posts".* FROM "posts" WHERE (title ~ 'Ac') ORDER BY created_at DESC LIMIT $1  [["LIMIT", 11]]

> Post.where("title ~ 'Act'").order('created_at DESC')
  Post Load (2.6ms)  SELECT "posts".* FROM "posts" WHERE (title ~ 'Act') ORDER BY created_at DESC LIMIT $1  [["LIMIT", 11]]

对这两条 SQL 查询进行Explain,你会发现其实他们都是命中了索引的了

> sql = "EXPLAIN ANALYZE " + Post.where("title ~ 'Ac'").order('created_at DESC').to_sql
=> "EXPLAIN ANALYZE SELECT \"posts\".* FROM \"posts\" WHERE (title ~ 'Ac') ORDER BY created_at DESC"

> ActiveRecord::Base.connection.execute(sql).each {|m| puts m } # 搜索 AC 索引命中
.....
{"QUERY PLAN"=>"        ->  Bitmap Index Scan on index_posts_on_title  (cost=0.00..3108.14 rows=19 width=0) (actual time=114.121..114.121 rows=723457 loops=1)"}
{"QUERY PLAN"=>"              Index Cond: ((title)::text ~ 'Ac'::text)"}
{"QUERY PLAN"=>"Planning Time: 0.371 ms"}
{"QUERY PLAN"=>"Execution Time: 249.191 ms"}

> sql = "EXPLAIN ANALYZE " + Post.where("title ~ 'Act'").order('created_at DESC').to_sql
=> "EXPLAIN ANALYZE SELECT \"posts\".* FROM \"posts\" WHERE (title ~ 'Act') ORDER BY created_at DESC"

> ActiveRecord::Base.connection.execute(sql).each {|m| puts m } # 搜索 ACt 索引命中
....
{"QUERY PLAN"=>"        ->  Bitmap Index Scan on index_posts_on_title  (cost=0.00..16.03 rows=4 width=0) (actual time=0.045..0.045 rows=6 loops=1)"}
{"QUERY PLAN"=>"              Index Cond: ((title)::text ~ 'Act'::text)"}
{"QUERY PLAN"=>"Planning Time: 0.269 ms"}
{"QUERY PLAN"=>"Execution Time: 0.125 ms"}
=> #<PG::Result:0x0000000132a66138 status=PGRES_TUPLES_OK ntuples=10 nfields=1 cmd_tuples=0>

由此可见,在搜索字符串的长度为 2 的时候,哪怕命中了索引,搜索效果也不太明显,当达到 3 个字符的时候,搜索效率就有质的变化。可能是跟pg_trgm这个插件有关?官网对它的描述是

A trigram is a group of three consecutive characters taken from a string. We can measure the similarity of two strings by counting the number of trigrams they share. This simple idea turns out to be very effective for measuring the similarity of words in many natural languages.

A trigram is a group of three consecutive characters taken from a string.可能查询字符串真的要 3 个字符或以上才能发挥它最大的效果?

> Post.where("title ~ 'Acti'").order('created_at DESC')
  Post Load (2.3ms)  SELECT "posts".* FROM "posts" WHERE (title ~ 'Acti') ORDER BY created_at DESC LIMIT $1  [["LIMIT", 11]]

> Post.where("title ~ 'Activ'").order('created_at DESC')
  Post Load (2.2ms)  SELECT "posts".* FROM "posts" WHERE (title ~ 'Activ') ORDER BY created_at DESC LIMIT $1  [["LIMIT", 11]]

> Post.where("title ~ 'Active'").order('created_at DESC')
  Post Load (2.7ms)  SELECT "posts".* FROM "posts" WHERE (title ~ 'Active') ORDER BY created_at DESC LIMIT $1  [["LIMIT", 11]]
.....

包含中文的情况

模糊查找中查询字符串是中文的时候哪怕命中了索引性能也没有像预期那样提高。

> Post.where("title ~ '编程语言'").order('created_at DESC')
  Post Load (271.5ms)  SELECT "posts".* FROM "posts" WHERE (title ~ '编程语言') ORDER BY created_at DESC LIMIT $1  [["LIMIT", 11]]

> sql = "EXPLAIN ANALYZE " + Post.where("title ~ '编程语言'").order('created_at DESC').to_sql
=> "EXPLAIN ANALYZE SELECT \"posts\".* FROM \"posts\" WHERE (title ~ '编程语言') ORDER BY created_at DESC"

> ActiveRecord::Base.connection.execute(sql).each {|m| puts m }
....
{"QUERY PLAN"=>"        ->  Bitmap Index Scan on index_posts_on_title  (cost=0.00..3108.00 rows=1 width=0) (actual time=95.129..95.129 rows=723457 loops=1)"}
{"QUERY PLAN"=>"              Index Cond: ((title)::text ~ '编程语言'::text)"}
{"QUERY PLAN"=>"Planning Time: 0.267 ms"}
{"QUERY PLAN"=>"Execution Time: 259.774 ms"}
=> #<PG::Result:0x00000001338674a8 status=PGRES_TUPLES_OK ntuples=11 nfields=1 cmd_tuples=0>

可见index_posts_on_title索引依旧命中,只是没有带来预想中的效果。篇幅有限,这个问题放到后面再解决。

总结

这篇文章简单讲述了,Rails 搭配 PostgreSQL 的时候要如何实现模糊查找,并且提供了简单的优化建议,然而这种做法表现似乎不是太稳定。当搜索字符串是 Ascii 字符并且长度大于 2 的时候性能提升不错,然而当搜索字符串是中文,或者是长度小于等于 2 的时候索引带来的性能提升就不是很明显了。至于搜索中文关键字的性能问题如何解决,咱们放在下一篇文章再聊吧。

中文是另一个问题 估计是 locate 没配置对

hooopo 回复

是,我看改 locate 是一个方法。这篇文章还提供了另一个方法。就是用二进制来搜 后面总结一下。我还没试。 https://billtian.github.io/digoal.blog/2016/05/06/02.html

最近也看用 gin_trgm_ops,过程中发现它 LIKEILIKE 支持良好,但是不支持 =

我用的是 pg 13

搜了一下却只有少数人遇到这个问题:

https://stackoverflow.com/questions/70384167/how-to-make-postgresql-use-gin-trgm-ops-index-for-equality-comparison

需要 登录 后方可回复, 如果你还没有账号请 注册新账号