数据库 Simple fuzzywuzzy with plsql

hooopo · 2019年02月22日 · 最后由 hooopo 回复于 2019年02月28日 · 6983 次阅读

Python 有一个非常厉害的包叫 fuzzywuzy，用来做字符串模糊匹配的。

Fuzzy string matching like a boss. It uses Levenshtein Distance to calculate the differences between sequences in a simple-to-use package.

大概作用是，量化两个字符串的相似度。

>>> fuzz.ratio("this is a test", "this is a test!")
    97

ratio 的取值范围是 [0, 100]，相对其他算法的优点是可以直观判断两个字符串的相似程度，0 代表完全不匹配，100 代表完全匹配，通过分数立即可以感知匹配程度。不像 Levenshtein 只能用于比较。

fuzzywuzzy 打分算法的另一个好处是可以多个 field 累加，比如：

select name from products order by ratio(name, 'Ruby On Rails') + ratio(company_name, 'Ruby On Rails') desc limit 10;

不会像 Levenshtein 一样由于个别字段相差悬殊，对结果产生太大的干扰。

一个 plsql 的简单实现：

CREATE OR REPLACE FUNCTION ratio(l text, r text) RETURNS int
  LANGUAGE plpgsql IMMUTABLE
    AS $$
      DECLARE
        diff      int;  -- levenshtein编辑距离
        l_len     int := length(l);
        r_len     int := length(r);
        short_len int;
        long_len  int;
        match_len int;
        result    int;
      BEGIN
        IF (l IS NULL) OR (r IS NULL) OR (l = '') OR (r = '') THEN
          result := 0;
        ELSE
          SELECT levenshtein(l, r)      INTO diff;
          SELECT GREATEST(l_len, r_len) INTO long_len;
          SELECT LEAST(l_len, r_len)    INTO short_len;
          -- match长度为：最长字符串减去编辑距离
          match_len := long_len - diff;
          -- 基于fuzzywuzzy公式
          -- Return a measure of the sequences’ similarity as a float in the range [0, 100].
          -- Where T is the total number of elements in both sequences, and M is the number of matches, this is 2.0*M / T. 
          -- Note that this is 100 if the sequences are identical, and 0 if they have nothing in common.
          -- https://docs.python.org/3/library/difflib.html#difflib.SequenceMatcher.ratio
          -- https://github.com/seatgeek/fuzzywuzzy
          result := ((2.0 * match_len * 100) / (long_len + short_len));
        END IF;
        RETURN(result);
      END;
    $$;

依赖 postgresql 内置 extension fuzzystrmatch：

create extension fuzzystrmatch;

调用：

select name from products order by ratio(name, 'Ruby On Rails') + ratio(company_name, 'Ruby On Rails') desc limit 10;

一些场景可以先用 LIKE 过滤掉一部分再排序，PS，postgresql 的 LIKE 和正则是可以利用索引的。

相关资源：

调试 plsql 可以使用 pgadmin 和 omnidb
https://github.com/pgcentral/pldebugger
https://omnidb.org/en/documentation-en/13-debug-plpgsql-functions-en
https://www.w3resource.com/PostgreSQL/pl-pgsql-control-structures.php

14 个赞

无引用文章

Postgres Fulltext Search (一)

stephen #0 2019年02月22日

老大写得营养太丰富，没人顶了

hooopo 在 Postgres Fulltext Search (一) 提及了此话题。 02月26日 18:24

hooopo #2 2019年02月27日

release as a pg extension: https://github.com/hooopo/pg-fuzzywuzzy

ksec #3 2019年02月27日

对

hooopo 回复

我看完之后第一个反应是，为什么这个功能不是 Default Extention ?

hooopo #4 2019年02月28日

对

ksec 回复

fuzzystrmatch 是 default，但和这个还是有点差异

hooopo 在如何实现一个信息架构友好的标签系统提及了此话题。 10月23日 14:28

需要登录后方可回复, 如果你还没有账号请注册新账号