- Mastering PostgreSQL 10
- Hans Jürgen Sch?nig
- 454字
- 2021-06-30 19:03:59
Taking advantage of pg_trgm
To do fuzzy searching with PostgreSQL, you can add the pg_trgm extension. To activate the extension, just run the following instruction:
test=# CREATE EXTENSION pg_trgm; CREATE EXTENSION
The pg_trgm extension is pretty powerful, and to show what it is capable of, I have compiled some sample data consisting of 2,354 names of villages and cities here in Austria, Europe.
Our sample data can be stored in a simple table:
test=# CREATE TABLE t_location (name text); CREATE TABLE
My company website has all the data and PostgreSQL enables you to load the data directly:
test=# COPY t_location FROM PROGRAM
'curl https://www.cybertec-postgresql.com/secret/orte.txt'; COPY 2354
Once the data has been loaded, it is possible to check out the content of the table:
test=# SELECT * FROM t_location LIMIT 4; name -------------------------------- Eisenstadt Rust Breitenbrunn am Neusiedler See Donnerskirchen (4 rows)
If German is not your mother tongue, it will be impossible to spell the names of those locations without severe mistakes.
The pg_trgm provides us with a distance operator that computes the distance between two strings:
test=# SELECT 'abcde' <-> 'abdeacb'; ?column? ---------- 0.833333
(1 row)
The distance is a number between zero and one. The lower the number, the more similar the two strings are.
How does this work? Trigrams take a string and dissect it into sequences of three characters each:
test=# SELECT show_trgm('abcdef'); show_trgm
-------------------------------------
{" a"," ab",abc,bcd,cde,def,"ef "}
(1 row)
These sequences will then be used to come up with the distance you have just seen. Of course, the distance operator can be used inside a query to find the closest match:
test=# SELECT *
FROM t_location
ORDER BY name <-> 'Kramertneusiedel'
LIMIT 3; name
-----------------
Gramatneusiedl
Klein-Neusiedl
Potzneusiedl
(3 rows)
Gramatneusiedl is pretty close to Kramertneusiedel. It sounds similar and using a K instead of a G is a pretty common mistake. On Google, you will sometimes see did you mean. It is quite likely that Google is using n-grams here to do that.
In PostgreSQL, it is possible to use GiST to index on text using trigrams:
test=# CREATE INDEX idx_trgm ON t_location
USING GiST(name GiST_trgm_ops); CREATE INDEX
pg_trgm provides us with the GiST_trgm_ops operator class designed to do similarity searches. The following listing shows that the index is used as expected:
test=# explain SELECT *
FROM t_location
ORDER BY name <-> 'Kramertneusiedel'
LIMIT 5; QUERY PLAN
-----------------------------------------------------------------
Limit (cost=0.14..0.58 rows=5 width=17)
-> Index Scan using idx_trgm on t_location
(cost=0.14..207.22 rows=2354 width=17)
Order By: (name <-> 'Kramertneusiedel'::text)
(3 rows)
- Ansible Configuration Management
- 現(xiàn)代測控電子技術(shù)
- 過程控制工程及仿真
- 最后一個人類
- Visual Basic從初學(xué)到精通
- 工業(yè)機器人運動仿真編程實踐:基于Android和OpenGL
- Excel 2007技巧大全
- Mastering Geospatial Analysis with Python
- 工業(yè)機器人集成應(yīng)用
- 運動控制系統(tǒng)(第2版)
- Getting Started with Tableau 2019.2
- 單片機硬件接口電路及實例解析
- SketchUp 2014 for Architectural Visualization(Second Edition)
- 多傳感器數(shù)據(jù)智能融合理論與應(yīng)用
- 計算機仿真技術(shù)