Python 有一个非常厉害的包叫 fuzzywuzy,用来做字符串模糊匹配的。
Fuzzy string matching like a boss. It uses Levenshtein Distance to calculate the differences between sequences in a simple-to-use package.
大概作用是,量化两个字符串的相似度。
>>> fuzz.ratio("this is a test", "this is a test!")
97
ratio 的取值范围是 [0, 100],相对其他算法的优点是可以直观判断两个字符串的相似程度,0 代表完全不匹配,100 代表完全匹配,通过分数立即可以感知匹配程度。不像 Levenshtein 只能用于比较。
fuzzywuzzy 打分算法的另一个好处是可以多个 field 累加,比如:
select name from products order by ratio(name, 'Ruby On Rails') + ratio(company_name, 'Ruby On Rails') desc limit 10;
不会像 Levenshtein 一样由于个别字段相差悬殊,对结果产生太大的干扰。
一个 plsql 的简单实现:
CREATE OR REPLACE FUNCTION ratio(l text, r text) RETURNS int
LANGUAGE plpgsql IMMUTABLE
AS $$
DECLARE
diff int; -- levenshtein编辑距离
l_len int := length(l);
r_len int := length(r);
short_len int;
long_len int;
match_len int;
result int;
BEGIN
IF (l IS NULL) OR (r IS NULL) OR (l = '') OR (r = '') THEN
result := 0;
ELSE
SELECT levenshtein(l, r) INTO diff;
SELECT GREATEST(l_len, r_len) INTO long_len;
SELECT LEAST(l_len, r_len) INTO short_len;
-- match长度为:最长字符串减去编辑距离
match_len := long_len - diff;
-- 基于fuzzywuzzy公式
-- Return a measure of the sequences’ similarity as a float in the range [0, 100].
-- Where T is the total number of elements in both sequences, and M is the number of matches, this is 2.0*M / T.
-- Note that this is 100 if the sequences are identical, and 0 if they have nothing in common.
-- https://docs.python.org/3/library/difflib.html#difflib.SequenceMatcher.ratio
-- https://github.com/seatgeek/fuzzywuzzy
result := ((2.0 * match_len * 100) / (long_len + short_len));
END IF;
RETURN(result);
END;
$$;
依赖 postgresql 内置 extension fuzzystrmatch:
create extension fuzzystrmatch;
调用:
select name from products order by ratio(name, 'Ruby On Rails') + ratio(company_name, 'Ruby On Rails') desc limit 10;
一些场景可以先用 LIKE 过滤掉一部分再排序,PS,postgresql 的 LIKE 和正则是可以利用索引的。
相关资源: