Wie viele ausländische Touristen gibt es in Ihrer Stadt? In meinem gibt es nur wenige, aber in der Regel gehen sie mitten auf der Straße verloren und wiederholen ein einziges Wort - den Namen von was auch immer. Und Passanten versuchen ihnen an ihren Fingern zu erklären, wohin sie gehen sollen, und wenn "meine nicht zu verstehen ist", nehmen sie die Hand und führen sie an ihr Ziel. Überraschenderweise befindet sich das Ziel normalerweise innerhalb von fünf Minuten zu Fuß, d.h. Diese Touristen hatten immer noch eine ungefähre Vorstellung von der Stadt. Vielleicht wurden sie von einer Papierkarte geführt.
Wie oft haben Sie sich persönlich in einer solchen Situation befunden, in einer unbekannten Stadt in einem anderen Land?
Das Aufkommen von Smartphones und Navigations-Apps hat viele Probleme gelöst. Hurra, Sie können Ihre Geolokalisierung sehen, Sie können finden, wohin Sie gehen müssen, in welche Richtung schätzen und sogar eine Route zeichnen.
Es gibt nur noch ein Problem: Alle Straßen in der Anwendung sind mit lokalen Hieroglyphen im lokalen Dialekt signiert, und okay, wenn das lateinische Alphabet im Gastland übernommen wird, gibt es in allen Smartphones eine lateinische Tastatur, und die Welt ist daran gewöhnt und dann fühlte ich mich unwohl wegen der im tschechischen Alphabet verwendeten diakritischen Zeichen. Und ich kann mir nur vorstellen, wie schmerzhaft und leidend Ausländer sind, die das kyrillische Alphabet sehen. Schauen Sie sich das pseudokyrillische Alphabet an und Sie werden es verstehen. Wenn ich an ihrer Stelle wäre, würde ich Namen und Adressen in lateinischer Sprache schreiben und versuchen, die klangphonetische Suche zu reproduzieren.
In der Veröffentlichung werde ich beschreiben, wie die phonetischen Suchalgorithmen Soudex in der Sphinx-Suchmaschine implementiert werden . Transliteration allein reicht hier nicht aus, allerdings ohne sie. Die resultierende Konfigurationsdatei ist auf dem GitHub Gist verfügbar .
Einführung
, , -, , , Sphinx Search.
, , , .. , - Sphinx.
, , , , , . , , .
, . Soundex Metaphone, . Soundex , Metaphone .
, Sphinx Soundex, , . , , . .. . .
. , : « » – , , « », , . , , , , , .
, Soundex, , , NYSIIS, Daitch-Mokotoff.
SphinxQL, :
mysql -h 127.0.0.1 -P 9306 --default-character-set=utf8
Sphinx, , Sphinx Search, , , . .
Soundex
. , Sphinx Search, , , .. .
, : , – . .
– , Sphinx .
, , , , , : . – , - , , – . " ", . , , , .
regexp_filter = (|) => a
regexp_filter = (|) =>
, – , GitHub Gist.
soundex :
morphology = soundex
, , Sphinx Soundex.
, , Sphinx. -. - , , . . «», «», - , «Lenina», «ulitsa Lenina».
mysql> call keywords(' Lenina Lennina Lenin', 'STREETS', 0);
+------+-----------+------------+
| qpos | tokenized | normalized |
+------+-----------+------------+
| 1 | lenin | l500 |
| 2 | lenina | l500 |
| 3 | lenina | l500 |
| 4 | lennina | l500 |
| 5 | lenin | l500 |
+------+-----------+------------+
, tokenized , . normalized, Sphinx , , morphology. 'Lenina' l500, '' l500, , - , . Lennina, Lenena, Lennona. , , .
, :
mysql> select * from STREETS where match('Lenena');
+------+--------------------------------------+-----------+--------------+
| id | aoguid | shortname | offname |
+------+--------------------------------------+-----------+--------------+
| 387 | 4b919f60-7f5d-4b9e-99af-a7a02d344767 | | |
+------+--------------------------------------+-----------+--------------+
Sphinx , . . , :
mysql> call keywords(' Plechanovskaya Plehanovskaja Plekhanovska', 'STREETS', 0);
+------+----------------+------------+
| qpos | tokenized | normalized |
+------+----------------+------------+
| 1 | plekhanovskaja | p42512 |
| 2 | plechanovskaya | p42512 |
| 3 | plehanovskaja | p4512 |
| 4 | plekhanovska | p42512 |
+------+----------------+------------+
plehanovskaja -
. Sphinx . , CALL QSUGGEST:
mysql> CALL QSUGGEST('Plehanovskaja', 'STREETS');
+----------------+----------+------+
| suggest | distance | docs |
+----------------+----------+------+
| plekhanovskaja | 1 | 1 |
| petrovskaja | 4 | 1 |
+----------------+----------+------+
, , . .. .
, :
min_infix_len = 2
suggest tokenized, .. , . , Soudex , QSUGGEST .
- :
mysql> select * from STREETS where match('30 let Pobedy');
+------+--------------------------------------+-----------+------------------------+
| id | aoguid | shortname | offname |
+------+--------------------------------------+-----------+------------------------+
| 677 | 87234d80-4098-40c0-adb2-fc83ef237a5f | | 30 |
+------+--------------------------------------+-----------+------------------------+
mysql> select * from STREETS where match('30 ');
+------+--------------------------------------+-----------+------------------------+
| id | aoguid | shortname | offname |
+------+--------------------------------------+-----------+------------------------+
| 677 | 87234d80-4098-40c0-adb2-fc83ef237a5f | | 30 |
+------+--------------------------------------+-----------+------------------------+
, .
Soundex
. , , , .
.
Sphinx index
, , , . , Sphinx , . .. , regexp_filter
, regexp_filter
.
morphology = soundex
– , . , .
Sphinx , , ! . RE2.
, : regexp_filter = \A(A|a) => a
, 0.
regexp_filter = \B(A|a) => 0
regexp_filter = \B(Y|y) => 0
...
, regexp_filter = \B(Y|y) =>
, - . , «» «Veelkaseem» .
mysql> call keywords(' Veelkaseem', 'STREETS', 0);
+------+-----------+------------+
| qpos | tokenized | normalized |
+------+-----------+------------+
| 1 | v738 | v738 |
| 2 | v738 | v738 |
+------+-----------+------------+
- :
mysql> call keywords(' Veelkaseem', 'STREETS', 0);
+------+-----------+------------+
| qpos | tokenized | normalized |
+------+-----------+------------+
| 1 | v738 | v738 |
| 2 | v0730308 | v0730308 |
+------+-----------+------------+
, H W .
, , /, H W, . .
regexp_filter = 0+ => 0
regexp_filter = 1+ => 1
...
:
mysql> call keywords(' Lenina Lennina Lenin', 'STREETS', 0);
+------+-----------+------------+
| qpos | tokenized | normalized |
+------+-----------+------------+
| 1 | l8 | l8 |
| 2 | l8 | l8 |
| 3 | l8 | l8 |
| 4 | l8 | l8 |
| 5 | l8 | l8 |
+------+-----------+------------+
mysql> select * from STREETS where match('Lenina');
+------+--------------------------------------+-----------+--------------+
| id | aoguid | shortname | offname |
+------+--------------------------------------+-----------+--------------+
| 387 | 4b919f60-7f5d-4b9e-99af-a7a02d344767 | | |
+------+--------------------------------------+-----------+--------------+
, . , tokenized , soundex-. QSUGGEST . - , – . ngram_chars. .
:
mysql> call keywords(' Plechanovskaya Plehanovskaja Plekhanovska', 'STREETS', 0);
+------+-----------+------------+
| qpos | tokenized | normalized |
+------+-----------+------------+
| 1 | p738234 | p738234 |
| 2 | p73823 | p73823 |
| 3 | p78234 | p78234 |
| 4 | p73823 | p73823 |
+------+-----------+------------+
, , QSUGGEST :
mysql> CALL QSUGGEST('Plehanovskaja', 'STREETS');
Empty set (0.00 sec)
mysql> CALL QSUGGEST('p73823', 'STREETS');
Empty set (0.00 sec)
mysql> CALL QSUGGEST('p78234', 'STREETS');
Empty set (0.00 sec)
, , , . , , . . , «30 »:
mysql> call keywords('30 let Podedy', 'STREETS', 0);
+------+-----------+------------+
| qpos | tokenized | normalized |
+------+-----------+------------+
| 1 | 30 | 30 |
| 2 | l6 | l6 |
| 3 | p6 | p6 |
+------+-----------+------------+
mysql> select * from STREETS where match('30 let Pobedy');
+------+--------------------------------------+-----------+------------------------+
| id | aoguid | shortname | offname |
+------+--------------------------------------+-----------+------------------------+
| 677 | 87234d80-4098-40c0-adb2-fc83ef237a5f | | 30 |
+------+--------------------------------------+-----------+------------------------+
:
mysql> select * from STREETS where match('');
+------+--------------------------------------+--------------+----------------------+
| id | aoguid | shortname | offname |
+------+--------------------------------------+--------------+----------------------+
| 873 | abdb0221-bfe8-4cf8-9217-0ed40b2f6f10 | | 30 |
| 1208 | f1127b16-8a8e-4520-b1eb-6932654abdcd | | 50 |
+------+--------------------------------------+--------------+----------------------+
, , , .
NYSIIS
. «» - . «» , , - , .
(?i) .
, . :
regexp_filter = (?i)\b(mac) => mcc
regexp_filter = (?i)(ee)\b => y
: H, W
regexp_filter = (?i)(a|e|i|o|u|y)h => \1
regexp_filter = (?i)(a|e|i|o|u|y)w => \1a
regexp_filter = (?i)\B(e|i|o|u) => a
regexp_filter = (?i)\B(q) => g
S
regexp_filter = (?i)s\b =>
AY Y
A
, , !!!
, - , , , CALL QSUGGEST.
:
mysql> call keywords(' Lenina Lennina Lenin', 'STREETS', 0);
+------+-----------+------------+
| qpos | tokenized | normalized |
+------+-----------+------------+
| 1 | lanan | lanan |
| 2 | lanan | lanan |
| 3 | lanan | lanan |
| 4 | lannan | lannan |
| 5 | lanan | lanan |
+------+-----------+------------+
mysql> call keywords(' Plechanovskaya Plehanovskaja Plekhanovska', 'STREETS', 0);
+------+---------------+---------------+
| qpos | tokenized | normalized |
+------+---------------+---------------+
| 1 | plachanavscaj | plachanavscaj |
| 2 | plachanavscay | plachanavscay |
| 3 | plaanavscaj | plaanavscaj |
| 4 | plachanavsc | plachanavsc |
+------+---------------+---------------+
, CALL QSUGGEST Plehanovskaja, plaanavscaj:
mysql> CALL QSUGGEST('plaanavscaj', 'STREETS');
+---------------+----------+------+
| suggest | distance | docs |
+---------------+----------+------+
| paanarscaj | 2 | 1 |
| plachanavscaj | 2 | 1 |
| latavscaj | 3 | 1 |
| sladcavscaj | 3 | 1 |
| pacravscaj | 3 | 1 |
+---------------+----------+------+
. - .
paanarscaj →
plachanavscaj →
latavscaj →
sladcavscaj →
pacravscaj →
- , . - . , . , , .
Daitch-Mokotoff Soundex
, , Soundex.
. , « », , , - , , - .
, .
.
, .. :
regexp_filter = (?i)\b(au) => 0
regexp_filter = (?i)(a|e|i|o|u|y)(au) => \17
, \B ,
regexp_filter = (?i)au =>
– - :
regexp_filter = (?i)j => 1
:
mysql> call keywords(' Lenina Lennina Lenin', 'STREETS', 0);
+------+-----------+------------+
| qpos | tokenized | normalized |
+------+-----------+------------+
| 1 | 866 | 866 |
| 2 | 866 | 866 |
| 3 | 866 | 866 |
| 4 | 8666 | 8666 |
| 5 | 866 | 866 |
+------+-----------+------------+
mysql> call keywords(' Plechanovskaya Plehanovskaja Plekhanovska', 'STREETS', 0);
+------+-----------+------------+
| qpos | tokenized | normalized |
+------+-----------+------------+
| 1 | 7856745 | 7856745 |
| 2 | 7856745 | 7856745 |
| 3 | 786745 | 786745 |
| 4 | 7856745 | 7856745 |
+------+-----------+------------+
, QSUGGEST . .
mysql> select * from STREETS where match('Veelkaseem'); show meta;
+------+--------------------------------------+--------------+----------------------+
| id | aoguid | shortname | offname |
+------+--------------------------------------+--------------+----------------------+
| 873 | abdb0221-bfe8-4cf8-9217-0ed40b2f6f10 | | 30 |
| 1208 | f1127b16-8a8e-4520-b1eb-6932654abdcd | | 50 |
+------+--------------------------------------+--------------+----------------------+
2 rows in set (0.00 sec)
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| total | 2 |
| total_found | 2 |
| time | 0.000 |
| keyword[0] | 78546 |
| docs[0] | 2 |
| hits[0] | 2 |
+---------------+-------+
, , - .
Soundex, , Soundex NYSIIS, CALL QSUGGEST, Sphinx , NYSIIS -. Soundex Daitch-Mokotoff Soundex, , , , 1286 , , - . :
mysql> call keywords(' ', 'STREETS', 0);
+------+------------+------------+
| qpos | tokenized | normalized |
+------+------------+------------+
| 1 | vorovskogo | v612 |
| 2 | verbovaja | v612 |
+------+------------+------------+
Soundex, :
mysql> call keywords(' ', 'STREETS', 0);
+------+-----------+------------+
| qpos | tokenized | normalized |
+------+-----------+------------+
| 1 | v9234 | v9234 |
| 2 | v9124 | v9124 |
+------+-----------+------------+
, . , Soundex:
mysql> select * from STREETS where match('');
+------+--------------------------------------+-----------+--------------------------+
| id | aoguid | shortname | offname |
+------+--------------------------------------+-----------+--------------------------+
| 12 | 0278d3ee-4e17-4347-b128-33f8f62c59e0 | | |
+------+--------------------------------------+-----------+--------------------------+
.
QSUGGEST, . , . , – .
, , : Soundex . - , , - , , Sphinx.
, , , Soundex Daitch-Mokotof - , . NYSIIS , , , .
sphinx-3.3.1, 2.1.1-beta, . Manticore. Manticore Search, . , , .
, . , .
P.S.
, . Metaphone . , , . :
-
????
PROFIT