数据库 Postgres Fulltext Search (一)

hooopo · February 26, 2019 · Last by lanzhiheng replied at February 21, 2021 · 11062 hits

Topic has been selected as the excellent topic by the admin.

Postgres 提供了 LIKE、ILIKE、正则匹配等字符串匹配函数，对于一般的字符串匹配需求已经够用。然而，传统 LIKE 匹配存在几个问题：

性能低效，LIKE 匹配可以通过 pg_trgm extension 建立 gin 或 gist 索引缓解
没有相关度支持，不能解决匹配程度问题，其实可以使用fuzzywuzzy来解决，但是麻烦了一些
没有 stemming 等
多条件 JOIN 效率低，查询复杂

安装 zhparser

安装 SCWS

wget -q -O - https://github.com/hightman/scws/archive/1.2.2.tar.gz | tar xzf -

cd scws-1.2.2 ; ./configure ; make install

下载 zhparser 源码

git clone https://github.com/amutu/zhparser.git

编译和安装 zhparser

SCWS_HOME=/usr/local make && make install

如果你使用的是阿里云 postgres rds，已经有预装好的 zhparser 插件，可以使用下面语句验证：

 select * from pg_available_extensions where name = 'zhparser';
   name   | default_version | installed_version |                 comment
----------+-----------------+-------------------+------------------------------------------
 zhparser | 1.0             | 1.0               | a parser for full-text search of Chinese

另外，阿里云 rds 是支持自定义词典的。

准备数据

从 kaggle 下载测试数据，也可以通过链接直接下载，需要注册账号：

kaggle datasets download -d noxmoon/chinese-official-daily-news-since-2016
unzip chinese-official-daily-news-since-2016.zip

创建新闻表，字段有 date、tag、headline、content

create table news (date date, tag varchar, headline varchar, content text);

使用 COPY 命令导入新闻数据

COPY news FROM '/Users/hooopo/w/learn_pg/chinese_news.csv' CSV HEADER;

tsvector 和 tsquery

postgres 的全文检索功能，核心就是 tsvector 和 tsquery 这两个 pg 内置类型，

select headline::tsvector from news where length(headline) < 8 limit 10;

     headline
------------------
 '太行山上新农人'
 '【数据春运】'
 '春运暖心归乡路'
 '爱心温暖返程路'
 '黄静汶同志逝世'
 '北戴河机场启用'
 '关注叙利亚局势'
 '【凡人善举】'
 '【凡人善举】'
 '童心共筑中国梦'

由于默认的 default_text_search_config 是 english，中文分词并没有生效：

select name, setting from pg_settings where name ~ 'text_search';

            name            |      setting
----------------------------+--------------------
 default_text_search_config | pg_catalog.english

创建一个中文分词的 search config：

CREATE TEXT SEARCH CONFIGURATION zhcfg (PARSER = zhparser);
ALTER TEXT SEARCH CONFIGURATION zhcfg 
ADD MAPPING FOR a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z 
WITH simple;

使用 zhcfg 之后的 tsvector：

select headline,to_tsvector('zhcfg', headline) as tsv from news where length(headline) < 8 limit 10;

    headline    |                  tsv
----------------+----------------------------------------
 太行山上新农人 | '人':4 '太行':1 '山上':2 '新农':3
 【数据春运】   | '【':1 '】':4 '数据':2 '春运':3
 春运暖心归乡路 | '归乡':4 '心':3 '春运':1 '暖':2 '路':5
 爱心温暖返程路 | '温暖':2 '爱心':1 '路':4 '返程':3
 黄静汶同志逝世 | '同志':2 '逝世':3 '黄静汶':1
 北戴河机场启用 | '北戴河':1 '启用':3 '机场':2
 关注叙利亚局势 | '关注':1 '叙利亚':2 '局势':3
 【凡人善举】   | '【':1 '】':4 '凡人':2 '善举':3
 【凡人善举】   | '【':1 '】':4 '凡人':2 '善举':3
 童心共筑中国梦 | '中国':4 '共':2 '梦':5 '童心':1 '筑':3

其中，标点符号也被当成了lexemes，设置忽略标点符号：

set zhparser.punctuation_ignore TO t; /*本session生效，全局生效需要修改config文件*/
select headline,to_tsvector('zhcfg', headline) as tsv from news where length(headline) < 8 limit 10;

    headline    |                  tsv
----------------+----------------------------------------
 太行山上新农人 | '人':4 '太行':1 '山上':2 '新农':3
 【数据春运】   | '数据':1 '春运':2
 春运暖心归乡路 | '归乡':4 '心':3 '春运':1 '暖':2 '路':5
 爱心温暖返程路 | '温暖':2 '爱心':1 '路':4 '返程':3
 黄静汶同志逝世 | '同志':2 '逝世':3 '黄静汶':1
 北戴河机场启用 | '北戴河':1 '启用':3 '机场':2
 关注叙利亚局势 | '关注':1 '叙利亚':2 '局势':3
 【凡人善举】   | '凡人':1 '善举':2
 【凡人善举】   | '凡人':1 '善举':2
 童心共筑中国梦 | '中国':4 '共':2 '梦':5 '童心':1 '筑':3

使用 tsvector 进行一次搜索：

select headline,
       to_tsvector('zhcfg', headline) as tsv 
  from news 
 where length(headline) < 8 
       and to_tsvector('zhcfg', headline) @@ to_tsquery('zhcfg', '中 国') 
 limit 10;

    headline    |                  tsv
----------------+----------------------------------------
 童心共筑中国梦 | '中国':4 '共':2 '梦':5 '童心':1 '筑':3
 【中国有我】   | '中国':1 '我':3 '有':2
 四海共庆中国年 | '中国':3 '共庆':2 '四海':1 '年':4
 【点赞中国】   | '中国':3 '点':1 '赞':2
 【点赞中国】   | '中国':3 '点':1 '赞':2
 【点赞中国】   | '中国':3 '点':1 '赞':2
 【中国有我】   | '中国':1 '我':3 '有':2
 【大美中国】   | '中国':3 '大':1 '美':2

select COALESCE(headline, '') || ' '  || COALESCE(content, '') as doc 
  from news 
 where length(headline) < 8 
       and to_tsvector('zhcfg', COALESCE(headline, '') || ' ' || COALESCE(content, '')) @@ to_tsquery('zhcfg', '中国') 
 limit 1;

反向索引

使用反向索引来提高查询速度：

CREATE INDEX fulltext_idx 
ON news 
USING GIN (to_tsvector('zhcfg', COALESCE(headline, '') || ' '  || COALESCE(content, '')));

看一下查询计划，上面语句在加了 GIN 索引之后，使用了 Bitmap Index：

Limit  (cost=16.79..27.29 rows=1 width=32) (actual time=3.174..3.175 rows=1 loops=1)
  ->  Bitmap Heap Scan on news  (cost=16.79..384.33 rows=35 width=32) (actual time=3.173..3.173 rows=1 loops=1)
        Recheck Cond: (to_tsvector('zhcfg'::regconfig, (((COALESCE(headline, ''::character varying))::text || ' '::text) || COALESCE(content, ''::text))) @@ '''中国'''::tsquery)
        Filter: (length((headline)::text) < 8)
        Rows Removed by Filter: 751
        Heap Blocks: exact=241
        ->  Bitmap Index Scan on fulltext_idx  (cost=0.00..16.78 rows=104 width=0) (actual time=2.027..2.027 rows=6272 loops=1)
              Index Cond: (to_tsvector('zhcfg'::regconfig, (((COALESCE(headline, ''::character varying))::text || ' '::text) || COALESCE(content, ''::text))) @@ '''中国'''::tsquery)
Planning time: 0.168 ms
Execution time: 3.231 ms

btree_gin

对于等值过滤和全文检索复合查询的情况，还可以使用 pg 内置的 btree_gin 提高查询效率，例如：where tag = '详细全文' AND to_tsvector(xx) @@ to_tsquery(xx)。

explain analyze select count(*) from news where tag = '详细全文' and to_tsvector('zhcfg', COALESCE(headline, '') || ' ' || COALESCE(content, '')) @@ to_tsquery('zhcfg', '中国');
                                                                                      QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=384.05..384.06 rows=1 width=8) (actual time=18.457..18.457 rows=1 loops=1)
   ->  Bitmap Heap Scan on news  (cost=16.79..383.91 rows=58 width=0) (actual time=1.571..17.917 rows=4822 loops=1)
         Recheck Cond: (to_tsvector('zhcfg'::regconfig, (((COALESCE(headline, ''::character varying))::text || ' '::text) || COALESCE(content, ''::text))) @@ '''中国'''::tsquery)
         Filter: ((tag)::text = '详细全文'::text)
         Rows Removed by Filter: 1450
         Heap Blocks: exact=1915
         ->  Bitmap Index Scan on fulltext_idx  (cost=0.00..16.78 rows=104 width=0) (actual time=1.300..1.300 rows=6272 loops=1)
               Index Cond: (to_tsvector('zhcfg'::regconfig, (((COALESCE(headline, ''::character varying))::text || ' '::text) || COALESCE(content, ''::text))) @@ '''中国'''::tsquery)

上面的查询，先通过 fulltext_idx 这条索引扫描了 6272 条记录，然后再通过 filter(tag='详细全文') 过滤掉了 1450 条记录。也就是说，我们可以通过 btree_gin，让 gin 索引和 btree 索引组合起来，发挥最大威力。

create index fulltext_idx_1 on news using gin (tag, to_tsvector('zhcfg', COALESCE(headline, '') || ' '  || COALESCE(content, '')));

再看一下查询计划：

explain analyze select count(*) from news where tag = '详细全文' and to_tsvector('zhcfg', COALESCE(headline, '') || ' ' || COALESCE(content, '')) @@ to_tsquery('zhcfg', '中国');
                                                                                                          QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=246.82..246.83 rows=1 width=8) (actual time=7.540..7.540 rows=1 loops=1)
   ->  Bitmap Heap Scan on news  (cost=28.59..246.68 rows=58 width=0) (actual time=4.363..6.899 rows=4822 loops=1)
         Recheck Cond: (((tag)::text = '详细全文'::text) AND (to_tsvector('zhcfg'::regconfig, (((COALESCE(headline, ''::character varying))::text || ' '::text) || COALESCE(content, ''::text))) @@ '''中国'''::tsquery))
         Heap Blocks: exact=1718
         ->  Bitmap Index Scan on fulltext_idx_1  (cost=0.00..28.58 rows=58 width=0) (actual time=3.870..3.870 rows=4822 loops=1)
               Index Cond: (((tag)::text = '详细全文'::text) AND (to_tsvector('zhcfg'::regconfig, (((COALESCE(headline, ''::character varying))::text || ' '::text) || COALESCE(content, ''::text))) @@ '''中国'''::tsquery))

直接通过 fulltext_idx_1 这条索引扫描了 4822 条记录，没有多余的 filter 过程，对于区分性高的过滤条件来说，会极大提升查询效率。

46 likes

Simple fuzzywuzzy with plsql

浅析 PostgreSQL + zhparser 进行中文搜索的分词与排序优化 Postgres Full Text Search with Docker Compose

jasl mark as excellent topic. 26 Feb 21:42

baurine in 在 Rails 中使用 PostgreSQL 的全文搜索功能搜索百万条记录的表的正确姿势 mention this topic. 27 Feb 11:28

rocLv #2 February 27, 2019

使用反向索引来提纲查询速度 应为 使用反向索引来提高查询速度

hooopo #3 February 27, 2019

Reply to

rocLv

fixed.

hfpp2012 #4 May 21, 2019

很赞

theblock24block #5 August 12, 2019

有第二篇吗

hooopo #6 August 12, 2019

Reply to

theblock24block

还有第三篇不过最近加班没时间写了

reducm #7 October 14, 2019

Reply to

hooopo

加完班了吗

hooopo #8 October 14, 2019

Reply to

reducm

不加了这周可以写完

novtopro #9 February 14, 2020

https://medium.com/datadriveninvestor/the-rum-index-and-full-text-search-using-postgresql-1c08fb3bf540 还可以看看 RUM 索引

1 likes

hooopo #10 February 14, 2020

Reply to

novtopro

用了 RUM 之后，基本上和 ES 差不多了，不过打分还是赶不上 ES

mengqing #11 October 23, 2020

找不到第二篇和第三篇

如果打分赶不上 ES，那用 Postgres 的 FTS 的好处在哪里呢？为什么不直接用 ES 呢？

hooopo #12 October 23, 2020

Reply to

mengqing

就是 tradeoff 用 es 有 es 的问题数据同步数据插入时间如果你的数据量上几十亿每次同步一次 es 可能需要几天更不要说 es 额外需要更多的机器成本了

lanzhiheng #13 February 21, 2021

好文，折腾了半天原来要先安装插件。不然就会报这个异常。

simple=# create index fulltext_idx_1 on news using GIN (tag, to_tsvector('zhcfg', COALESCE(headline, '') || ' '  || COALESCE(content, '')));
ERROR:  data type character varying has no default operator class for access method "gin"
HINT:  You must specify an operator class for the index or define a default operator class for the data type.

解决方案：

CREATE EXTENSION btree_gin;

simple=# select * from pg_available_extensions where name ~ 'tree';
    name    | default_version | installed_version |                     comment
------------+-----------------+-------------------+-------------------------------------------------
 btree_gin  | 1.3             | 1.3               | support for indexing common datatypes in GIN
 ltree      | 1.2             |                   | data type for hierarchical tree-like structures
 btree_gist | 1.5             |                   | support for indexing common datatypes in GiST
(3 rows)

simple=# create index fulltext_idx_1 on news using GIN (tag, to_tsvector('zhcfg', COALESCE(headline, '') || ' '  || COALESCE(content, '')));
CREATE INDEX

真的快了好多。

hiveer in Postgres Full Text Search with Docker Compose mention this topic. 02 Jan 20:06

SuMingXuan in 浅析 PostgreSQL + zhparser 进行中文搜索的分词与排序优化 mention this topic. 10 Nov 20:49

You need to Sign in before reply, if you don't have an account, please Sign up first.

46 likes

Total 12 replies

Reward

New Reply comming, click to load.