Hallo, dies ist mein dritter Artikel über Habré. Zuvor habe ich einen Artikel über das ALM-Sprachmodell geschrieben . Jetzt möchte ich Ihnen das ASC- Tippfehlerkorrektursystem vorstellen (implementiert auf Basis von ALM ).
Ja, es gibt eine große Anzahl von Systemen zur Korrektur von Tippfehlern, alle haben ihre eigenen Stärken und Schwächen. Aus offenen Systemen kann ich eines der vielversprechendsten JamSpell herausgreifen , und wir werden es vergleichen. Es gibt auch ein ähnliches System von DeepPavlov , über das viele nachdenken könnten, aber ich habe mich nie damit angefreundet.
Funktionsliste:
- Korrektur von Wortfehlern mit einer Differenz von bis zu 4 Levenshtein-Abständen.
- Korrektur von Tippfehlern in Wörtern (Einfügen, Löschen, Ersetzen, Neuanordnen) von Zeichen.
- Ication fikation angesichts des Kontextes.
- Setzen Sie den Fall des ersten Buchstabens des Wortes fĂĽr (Eigennamen und Titel) unter BerĂĽcksichtigung des Kontexts.
- Aufteilen der kombinierten Wörter in separate Wörter unter Berücksichtigung des Kontexts.
- FĂĽhrt eine Textanalyse durch, ohne den Originaltext zu korrigieren.
- Suchen Sie im Text nach Präsenz (Fehler, Tippfehler, falscher Kontext).
UnterstĂĽtzte Betriebssysteme:
- Mac OS X
- FreeBSD
- Linux
Das System ist in C ++ 11 geschrieben, es gibt einen Port fĂĽr Python3
Fertige Wörterbücher
| Name | Größe (GB) | RAM (GB) | Größe N-Gramm | Sprache |
|---|---|---|---|---|
| wittenbell-3-big.asc | 1,97 | 15.6 | 3 | RU |
| wittenbell-3-middle.asc | 1.24 | 9.7 | 3 | RU |
| mkneserney-3-middle.asc | 1,33 | 9.7 | 3 | RU |
| wittenbell-3-single.asc | 0,772 | 5.14 | 3 | RU |
| wittenbell-5-single.asc | 1,37 | 10.7 | fĂĽnf | RU |
Testen
Zum Testen des Systems wurden Daten aus dem Dialog21- Wettbewerb "Tippfehlerkorrektur" 2016 verwendet . Zum Testen wurde ein trainiertes Binärwörterbuch verwendet: wittenbell-3-middle.asc
| Test durchgeführt | Präzision | Erinnern | FMeasure |
|---|---|---|---|
| Tippfehlerkorrekturmodus | 76,97 | 62,71 | 69.11 |
| Fehlerkorrekturmodus | 73,72 | 60,53 | 66,48 |
Ich denke, es ist nicht notwendig, andere Daten hinzuzufĂĽgen, falls gewĂĽnscht, kann jeder den Test wiederholen. Ich fĂĽge alle Materialien hinzu, die beim Testen unten verwendet werden.
Beim Testen verwendete Materialien
- test.txt - Zu testender Text
- rect.txt - Text der richtigen Varianten
- evaluieren.py - Python3- Skript zur Berechnung der Korrekturergebnisse
Nun ist es interessant, die Funktionsweise der Systeme zur Korrektur von Tippfehlern selbst unter gleichen Bedingungen zu vergleichen. Wir werden zwei verschiedene Tippfehler mit denselben Textdaten trainieren und einen Test durchfĂĽhren.
Nehmen wir zum Vergleich das oben erwähnte Tippfehlerkorrektursystem JamSpell .
ASC gegen JamSpell
Installation
ASC
JamSpell
$ git clone --recursive https://github.com/anyks/asc.git
$ cd ./asc
$ mkdir ./build
$ cd ./build
$ cmake ..
$ make
JamSpell
$ git clone https://github.com/bakwc/JamSpell.git
$ cd ./JamSpell
$ mkdir ./build
$ cd ./build
$ cmake ..
$ make
Ausbildung
ASC
train.json
Python3
JamSpell
train.json
{
"ext": "txt",
"size": 3,
"alter": {"":""},
"debug": 1,
"threads": 0,
"method": "train",
"allow-unk": true,
"reset-unk": true,
"confidence": true,
"interpolate": true,
"mixed-dicts": true,
"only-token-words": true,
"locale": "en_US.UTF-8",
"smoothing": "wittenbell",
"pilots": ["","","","","","","","","","","a","i","o","e","g"],
"corpus": "./texts/correct.txt",
"w-bin": "./dictionary/3-middle.asc",
"w-vocab": "./train/lm.vocab",
"w-arpa": "./train/lm.arpa",
"mix-restwords": "./similars/letters.txt",
"alphabet": "abcdefghijklmnopqrstuvwxyz",
"bin-code": "ru",
"bin-name": "Russian",
"bin-author": "You name",
"bin-copyright": "You company LLC",
"bin-contacts": "site: https://example.com, e-mail: info@example.com",
"bin-lictype": "MIT",
"bin-lictext": "... License text ...",
"embedding-size": 28,
"embedding": {
"": 0, "": 1, "": 2, "": 3, "": 4, "": 5,
"": 5, "": 6, "": 7, "": 8, "": 8, "": 9,
"": 10, "": 11, "": 12, "": 0, "": 13, "": 14,
"": 15, "": 16, "": 17, "": 18, "": 19, "": 20,
"": 21, "": 21, "": 21, "": 22, "": 23, "": 22,
"": 5, "": 24, "": 25, "<": 26, ">": 26, "~": 26,
"-": 26, "+": 26, "=": 26, "*": 26, "/": 26, ":": 26,
"%": 26, "|": 26, "^": 26, "&": 26, "#": 26, "'": 26,
"\\": 26, "0": 27, "1": 27, "2": 27, "3": 27, "4": 27,
"5": 27, "6": 27, "7": 27, "8": 27, "9": 27, "a": 0,
"b": 2, "c": 15, "d": 4, "e": 5, "f": 18, "g": 3,
"h": 12, "i": 8, "j": 6, "k": 9, "l": 10, "m": 11,
"n": 12, "o": 0, "p": 14, "q": 13, "r": 14, "s": 15,
"t": 16, "u": 24, "v": 21, "w": 22, "x": 19, "y": 17, "z": 7
}
}
$ ./asc -r-json ./train.json
Python3
import asc
asc.setSize(3)
asc.setAlmV2()
asc.setThreads(0)
asc.setLocale("en_US.UTF-8")
asc.setOption(asc.options_t.uppers)
asc.setOption(asc.options_t.allowUnk)
asc.setOption(asc.options_t.resetUnk)
asc.setOption(asc.options_t.mixDicts)
asc.setOption(asc.options_t.tokenWords)
asc.setOption(asc.options_t.confidence)
asc.setOption(asc.options_t.interpolate)
asc.setAlphabet("abcdefghijklmnopqrstuvwxyz")
asc.setPilots(["","","","","","","","","","","a","i","o","e","g"])
asc.setSubstitutes({'p':'','c':'','o':'','t':'','k':'','e':'','a':'','h':'','x':'','b':'','m':''})
def statusArpa1(status):
print("Build arpa", status)
def statusArpa2(status):
print("Write arpa", status)
def statusVocab(status):
print("Write vocab", status)
def statusIndex(text, status):
print(text, status)
def status(text, status):
print(text, status)
asc.collectCorpus("./texts/correct.txt", asc.smoothing_t.wittenBell, 0.0, False, False, status)
asc.buildArpa(statusArpa1)
asc.writeArpa("./train/lm.arpa", statusArpa2)
asc.writeVocab("./train/lm.vocab", statusVocab)
asc.setCode("RU")
asc.setLictype("MIT")
asc.setName("Russian")
asc.setAuthor("You name")
asc.setCopyright("You company LLC")
asc.setLictext("... License text ...")
asc.setContacts("site: https://example.com, e-mail: info@example.com")
asc.setEmbedding({
"": 0, "": 1, "": 2, "": 3, "": 4, "": 5,
"": 5, "": 6, "": 7, "": 8, "": 8, "": 9,
"": 10, "": 11, "": 12, "": 0, "": 13, "": 14,
"": 15, "": 16, "": 17, "": 18, "": 19, "": 20,
"": 21, "": 21, "": 21, "": 22, "": 23, "": 22,
"": 5, "": 24, "": 25, "<": 26, ">": 26, "~": 26,
"-": 26, "+": 26, "=": 26, "*": 26, "/": 26, ":": 26,
"%": 26, "|": 26, "^": 26, "&": 26, "#": 26, "'": 26,
"\\": 26, "0": 27, "1": 27, "2": 27, "3": 27, "4": 27,
"5": 27, "6": 27, "7": 27, "8": 27, "9": 27, "a": 0,
"b": 2, "c": 15, "d": 4, "e": 5, "f": 18, "g": 3,
"h": 12, "i": 8, "j": 6, "k": 9, "l": 10, "m": 11,
"n": 12, "o": 0, "p": 14, "q": 13, "r": 14, "s": 15,
"t": 16, "u": 24, "v": 21, "w": 22, "x": 19, "y": 17, "z": 7
}, 28)
asc.saveIndex("./dictionary/3-middle.asc", "", 128, statusIndex)
JamSpell
$ ./main/jamspell train ../test_data/alphabet_ru.txt ../test_data/correct.txt ./model.bin
Testen
ASC
spell.json
Python3
JamSpell
- Python , C++
spell.json
{
"debug": 1,
"threads": 0,
"method": "spell",
"spell-verbose": true,
"confidence": true,
"mixed-dicts": true,
"asc-split": true,
"asc-alter": true,
"asc-esplit": true,
"asc-rsplit": true,
"asc-uppers": true,
"asc-hyphen": true,
"asc-wordrep": true,
"r-text": "./texts/test.txt",
"w-text": "./texts/output.txt",
"r-bin": "./dictionary/3-middle.asc"
}
$ ./asc -r-json ./spell.json
Python3
import asc
asc.setAlmV2()
asc.setThreads(0)
asc.setOption(asc.options_t.uppers)
asc.setOption(asc.options_t.ascSplit)
asc.setOption(asc.options_t.ascAlter)
asc.setOption(asc.options_t.ascESplit)
asc.setOption(asc.options_t.ascRSplit)
asc.setOption(asc.options_t.ascUppers)
asc.setOption(asc.options_t.ascHyphen)
asc.setOption(asc.options_t.ascWordRep)
asc.setOption(asc.options_t.mixDicts)
asc.setOption(asc.options_t.confidence)
def status(text, status):
print(text, status)
asc.loadIndex("./dictionary/3-middle.asc", "", status)
f1 = open('./texts/test.txt')
f2 = open('./texts/output.txt', 'w')
for line in f1.readlines():
res = asc.spell(line)
f2.write("%s\n" % res[0])
f2.close()
f1.close()
JamSpell
- Python , C++
#include <fstream>
#include <iostream>
#include <jamspell/spell_corrector.hpp>
// BOOST
#ifdef USE_BOOST_CONVERT
#include <boost/locale/encoding_utf.hpp>
//
#else
#include <codecvt>
#endif
using namespace std;
/**
* convert utf-8
* @param str utf-8
* @return
*/
const string convert(const wstring & str){
//
string result = "";
//
if(!str.empty()){
// BOOST
#ifdef USE_BOOST_CONVERT
//
using boost::locale::conv::utf_to_utf;
// utf-8
result = utf_to_utf <char> (str.c_str(), str.c_str() + str.size());
//
#else
// UTF-8
using convert_type = codecvt_utf8 <wchar_t, 0x10ffff, little_endian>;
//
wstring_convert <convert_type, wchar_t> conv;
// wstring_convert <codecvt_utf8 <wchar_t>> conv;
// utf-8
result = conv.to_bytes(str);
#endif
}
//
return result;
}
/**
* convert utf-8
* @param str
* @return utf-8
*/
const wstring convert(const string & str){
//
wstring result = L"";
//
if(!str.empty()){
// BOOST
#ifdef USE_BOOST_CONVERT
//
using boost::locale::conv::utf_to_utf;
// utf-8
result = utf_to_utf <wchar_t> (str.c_str(), str.c_str() + str.size());
//
#else
//
// wstring_convert <codecvt_utf8 <wchar_t>> conv;
wstring_convert <codecvt_utf8_utf16 <wchar_t, 0x10ffff, little_endian>> conv;
// utf-8
result = conv.from_bytes(str);
#endif
}
//
return result;
}
/**
* safeGetline
* @param is
* @param t
* @return
*/
istream & safeGetline(istream & is, string & t){
//
t.clear();
istream::sentry se(is, true);
streambuf * sb = is.rdbuf();
for(;;){
int c = sb->sbumpc();
switch(c){
case '\n': return is;
case '\r':
if(sb->sgetc() == '\n') sb->sbumpc();
return is;
case streambuf::traits_type::eof():
if(t.empty()) is.setstate(ios::eofbit);
return is;
default: t += (char) c;
}
}
}
/**
* main
*/
int main(){
//
NJamSpell::TSpellCorrector corrector;
//
corrector.LoadLangModel("model.bin");
//
ifstream file1("./test_data/test.txt", ios::in);
//
if(file1.is_open()){
//
string line = "", res = "";
//
ofstream file2("./test_data/output.txt", ios::out);
//
if(file2.is_open()){
//
while(file1.good()){
//
safeGetline(file1, line);
// ,
if(!line.empty()){
//
res = convert(corrector.FixFragment(convert(line)));
// ,
if(!res.empty()){
//
res.append("\n");
//
file2.write(res.c_str(), res.size());
}
}
}
//
file2.close();
}
//
file1.close();
}
return 0;
}
$ g++ -std=c++11 -I../JamSpell -L./build/jamspell -L./build/contrib/cityhash -L./build/contrib/phf -ljamspell_lib -lcityhash -lphf ./test.cpp -o ./bin/test
$ ./bin/test
Ergebnisse
Ergebnisse erzielen
$ python3 evaluate.py ./texts/test.txt ./texts/correct.txt ./texts/output.txt
ASC
| Präzision | Erinnern | FMeasure |
|---|---|---|
| 92.13 | 82,51 | 87.05 |
JamSpell
| Präzision | Erinnern | FMeasure |
|---|---|---|
| 77,87 | 63,36 | 69,87 |
Eines der Hauptmerkmale von ASC ist das Lernen aus schmutzigen Daten. Es ist praktisch unmöglich, Textkorpora ohne Fehler und Tippfehler im Open Access zu finden. Es reicht nicht aus, um Terabytes an Daten von Hand zu reparieren, aber Sie müssen irgendwie damit arbeiten.
Das Lehrprinzip, das ich anbiete
- Zusammenstellen eines Sprachmodells unter Verwendung schmutziger Daten
- Wir entfernen alle seltenen Wörter und N-Gramm im zusammengesetzten Sprachmodell
- Wir fügen einzelne Wörter für eine korrektere Bedienung des Tippfehlerkorrektursystems hinzu.
- Ein binäres Wörterbuch zusammenstellen
Lass uns anfangen
Angenommen, wir haben mehrere Korpusse verschiedener Fächer. Es ist logischer, sie separat zu trainieren und dann zu kombinieren.
Zusammenbau des Chassis mit ALM
collect.json
Python
,
{
"size": 3,
"debug": 1,
"threads": 0,
"ext": "txt",
"method": "train",
"allow-unk": true,
"mixed-dicts": true,
"only-token-words": true,
"smoothing": "wittenbell",
"locale": "en_US.UTF-8",
"w-abbr": "./output/alm.abbr",
"w-map": "./output/alm.map",
"w-vocab": "./output/alm.vocab",
"w-words": "./output/words.txt",
"corpus": "./texts/corpus",
"abbrs": "./abbrs/abbrs.txt",
"goodwords": "./texts/whitelist/words.txt",
"badwords": "./texts/blacklist/garbage.txt",
"mix-restwords": "./texts/similars/letters.txt",
"alphabet": "abcdefghijklmnopqrstuvwxyz"
}
$ ./alm -r-json ./collect.json
- size — N- 3
- debug —
- threads —
- ext —
- allow-unk — 〈unk〉
- mixed-dicts —
- only-token-words — N- —
- smoothing — wittenbell ( , - )
- locale — ( )
- w-abbr —
- w-map —
- w-vocab —
- w-words — ( )
- corpus —
- abbrs — , , (, , ...)
- goodwords —
- badwords —
- mix-restwords —
- alphabet — ( )
Python
import alm
# N- 3
alm.setSize(3)
#
alm.setThreads(0)
# ( )
alm.setLocale("en_US.UTF-8")
# ( )
alm.setAlphabet("abcdefghijklmnopqrstuvwxyz")
#
alm.setSubstitutes({'p':'','c':'','o':'','t':'','k':'','e':'','a':'','h':'','x':'','b':'','m':''})
# <unk>
alm.setOption(alm.options_t.allowUnk)
#
alm.setOption(alm.options_t.mixDicts)
# N- —
alm.setOption(alm.options_t.tokenWords)
# wittenbell ( , - )
alm.init(alm.smoothing_t.wittenBell)
# , , (, , ...)
f = open('./abbrs/abbrs.txt')
for abbr in f.readlines():
abbr = abbr.replace("\n", "")
alm.addAbbr(abbr)
f.close()
#
f = open('./texts/whitelist/words.txt')
for word in f.readlines():
word = word.replace("\n", "")
alm.addGoodword(word)
f.close()
#
f = open('./texts/blacklist/garbage.txt')
for word in f.readlines():
word = word.replace("\n", "")
alm.addBadword(word)
f.close()
def status(text, status):
print(text, status)
def statusWords(status):
print("Write words", status)
def statusVocab(status):
print("Write vocab", status)
def statusMap(status):
print("Write map", status)
def statusSuffix(status):
print("Write suffix", status)
#
alm.collectCorpus("./texts/corpus", status)
#
alm.writeWords("./output/words.txt", statusWords)
#
alm.writeVocab("./output/alm.vocab", statusVocab)
#
alm.writeMap("./output/alm.map", statusMap)
#
alm.writeSuffix("./output/alm.abbr", statusSuffix)
,
Zusammengebauter Rumpfschnitt mit ALM
prune.json
Python
{
"size": 3,
"debug": 1,
"allow-unk": true,
"method": "vprune",
"vprune-wltf": -15.0,
"locale": "en_US.UTF-8",
"smoothing": "wittenbell",
"r-map": "./corpus1/alm.map",
"r-vocab": "./corpus1/alm.vocab",
"w-map": "./output/alm.map",
"w-vocab": "./output/alm.vocab",
"goodwords": "./texts/whitelist/words.txt",
"badwords": "./texts/blacklist/garbage.txt",
"alphabet": "abcdefghijklmnopqrstuvwxyz"
}
$ ./alm -r-json ./prune.json
- size — N- 3
- debug —
- allow-unk — 〈unk〉
- vprune-wltf — - (, — )
- locale — ( )
- smoothing — wittenbell ( , - )
- r-map —
- r-vocab —
- w-map —
- w-vocab —
- goodwords —
- badwords —
- alphabet — ( )
Python
import alm
# N- 3
alm.setSize(3)
#
alm.setThreads(0)
# ( )
alm.setLocale("en_US.UTF-8")
# ( )
alm.setAlphabet("abcdefghijklmnopqrstuvwxyz")
# <unk>
alm.setOption(alm.options_t.allowUnk)
# wittenbell ( , - )
alm.init(alm.smoothing_t.wittenBell)
#
f = open('./texts/whitelist/words.txt')
for word in f.readlines():
word = word.replace("\n", "")
alm.addGoodword(word)
f.close()
#
f = open('./texts/blacklist/garbage.txt')
for word in f.readlines():
word = word.replace("\n", "")
alm.addBadword(word)
f.close()
def statusPrune(status):
print("Prune data", status)
def statusReadVocab(text, status):
print("Read vocab", text, status)
def statusWriteVocab(status):
print("Write vocab", status)
def statusReadMap(text, status):
print("Read map", text, status)
def statusWriteMap(status):
print("Write map", status)
#
alm.readVocab("./corpus1/alm.vocab", statusReadVocab)
#
alm.readMap("./corpus1/alm.map", statusReadMap)
#
alm.pruneVocab(-15.0, 0, 0, statusPrune)
#
alm.writeVocab("./output/alm.vocab", statusWriteVocab)
#
alm.writeMap("./output/alm.map", statusWriteMap)
Kombinierte Daten mit ALM kombinieren
merge.json
Python
{
"size": 3,
"debug": 1,
"allow-unk": true,
"method": "merge",
"mixed-dicts": "true",
"locale": "en_US.UTF-8",
"smoothing": "wittenbell",
"r-words": "./texts/words",
"r-map": "./corpus1",
"r-vocab": "./corpus1",
"w-map": "./output/alm.map",
"w-vocab": "./output/alm.vocab",
"goodwords": "./texts/whitelist/words.txt",
"badwords": "./texts/blacklist/garbage.txt",
"mix-restwords": "./texts/similars/letters.txt",
"alphabet": "abcdefghijklmnopqrstuvwxyz"
}
$ ./alm -r-json ./merge.json
- size — N- 3
- debug —
- allow-unk — 〈unk〉
- mixed-dicts —
- locale — ( )
- smoothing — wittenbell ( , - )
- r-words —
- r-map — ,
- r-vocab — ,
- w-map —
- w-vocab —
- goodwords —
- badwords —
- alphabet — ( )
Python
import alm
# N- 3
alm.setSize(3)
#
alm.setThreads(0)
# ( )
alm.setLocale("en_US.UTF-8")
# ( )
alm.setAlphabet("abcdefghijklmnopqrstuvwxyz")
#
alm.setSubstitutes({'p':'','c':'','o':'','t':'','k':'','e':'','a':'','h':'','x':'','b':'','m':''})
# <unk>
alm.setOption(alm.options_t.allowUnk)
#
alm.setOption(alm.options_t.mixDicts)
# wittenbell ( , - )
alm.init(alm.smoothing_t.wittenBell)
#
f = open('./texts/whitelist/words.txt')
for word in f.readlines():
word = word.replace("\n", "")
alm.addGoodword(word)
f.close()
#
f = open('./texts/blacklist/garbage.txt')
for word in f.readlines():
word = word.replace("\n", "")
alm.addBadword(word)
f.close()
#
f = open('./texts/words.txt')
for word in f.readlines():
word = word.replace("\n", "")
alm.addWord(word)
f.close()
def statusReadVocab(text, status):
print("Read vocab", text, status)
def statusWriteVocab(status):
print("Write vocab", status)
def statusReadMap(text, status):
print("Read map", text, status)
def statusWriteMap(status):
print("Write map", status)
#
alm.readVocab("./corpus1", statusReadVocab)
#
alm.readMap("./corpus1", statusReadMap)
#
alm.writeVocab("./output/alm.vocab", statusWriteVocab)
#
alm.writeMap("./output/alm.map", statusWriteMap)
Mit ALM ein Sprachmodell lernen
train.json
Python
{
"size": 3,
"debug": 1,
"allow-unk": true,
"reset-unk": true,
"interpolate": true,
"method": "train",
"locale": "en_US.UTF-8",
"smoothing": "wittenbell",
"r-map": "./output/alm.map",
"r-vocab": "./output/alm.vocab",
"w-arpa": "./output/alm.arpa",
"w-words": "./output/words.txt",
"alphabet": "abcdefghijklmnopqrstuvwxyz"
}
$ ./alm -r-json ./train.json
- size — N- 3
- debug —
- allow-unk — 〈unk〉
- reset-unk — , 〈unk〉
- interpolate —
- locale — ( )
- smoothing — wittenbell
- r-map — ,
- r-vocab — ,
- w-arpa — ARPA,
- w-words — , ( )
- alphabet — ( )
Python
import alm
# N- 3
alm.setSize(3)
#
alm.setThreads(0)
# ( )
alm.setLocale("en_US.UTF-8")
# ( )
alm.setAlphabet("abcdefghijklmnopqrstuvwxyz")
#
alm.setSubstitutes({'p':'','c':'','o':'','t':'','k':'','e':'','a':'','h':'','x':'','b':'','m':''})
# <unk>
alm.setOption(alm.options_t.allowUnk)
# <unk>
alm.setOption(alm.options_t.resetUnk)
#
alm.setOption(alm.options_t.mixDicts)
#
alm.setOption(alm.options_t.interpolate)
# wittenbell ( , - )
alm.init(alm.smoothing_t.wittenBell)
def statusReadVocab(text, status):
print("Read vocab", text, status)
def statusReadMap(text, status):
print("Read map", text, status)
def statusBuildArpa(status):
print("Build ARPA", status)
def statusWriteMap(status):
print("Write map", status)
def statusWriteArpa(status):
print("Write ARPA", status)
def statusWords(status):
print("Write words", status)
#
alm.readVocab("./output/alm.vocab", statusReadVocab)
#
alm.readMap("./output/alm.map", statusReadMap)
#
alm.buildArpa(statusBuildArpa)
# ARPA
alm.writeArpa("./output/alm.arpa", statusWriteArpa)
#
alm.writeWords("./output/words.txt", statusWords)
RechtschreibprĂĽfung ASC-Training
train.json
Python
{
"size": 3,
"debug": 1,
"threads": 0,
"confidence": true,
"mixed-dicts": true,
"method": "train",
"alter": {"":""},
"locale": "en_US.UTF-8",
"smoothing": "wittenbell",
"pilots": ["","","","","","","","","","","a","i","o","e","g"],
"w-bin": "./dictionary/3-single.asc",
"r-abbr": "./output/alm.abbr",
"r-vocab": "./output/alm.vocab",
"r-arpa": "./output/alm.arpa",
"abbrs": "./texts/abbrs/abbrs.txt",
"goodwords": "./texts/whitelist/words.txt",
"badwords": "./texts/blacklist/garbage.txt",
"alters": "./texts/alters/yoficator.txt",
"upwords": "./texts/words/upp",
"mix-restwords": "./texts/similars/letters.txt",
"alphabet": "abcdefghijklmnopqrstuvwxyz",
"bin-code": "ru",
"bin-name": "Russian",
"bin-author": "You name",
"bin-copyright": "You company LLC",
"bin-contacts": "site: https://example.com, e-mail: info@example.com",
"bin-lictype": "MIT",
"bin-lictext": "... License text ...",
"embedding-size": 28,
"embedding": {
"": 0, "": 1, "": 2, "": 3, "": 4, "": 5,
"": 5, "": 6, "": 7, "": 8, "": 8, "": 9,
"": 10, "": 11, "": 12, "": 0, "": 13, "": 14,
"": 15, "": 16, "": 17, "": 18, "": 19, "": 20,
"": 21, "": 21, "": 21, "": 22, "": 23, "": 22,
"": 5, "": 24, "": 25, "<": 26, ">": 26, "~": 26,
"-": 26, "+": 26, "=": 26, "*": 26, "/": 26, ":": 26,
"%": 26, "|": 26, "^": 26, "&": 26, "#": 26, "'": 26,
"\\": 26, "0": 27, "1": 27, "2": 27, "3": 27, "4": 27,
"5": 27, "6": 27, "7": 27, "8": 27, "9": 27, "a": 0,
"b": 2, "c": 15, "d": 4, "e": 5, "f": 18, "g": 3,
"h": 12, "i": 8, "j": 6, "k": 9, "l": 10, "m": 11,
"n": 12, "o": 0, "p": 14, "q": 13, "r": 14, "s": 15,
"t": 16, "u": 24, "v": 21, "w": 22, "x": 19, "y": 17, "z": 7
}
}
$ ./asc -r-json ./train.json
- size — N- 3
- debug —
- threads —
- confidence — ARPA - ,
- mixed-dicts —
- alter — ( , , — «»)
- locale — ( )
- smoothing — wittenbell ( , - )
- pilots — ( )
- w-bin —
- r-abbr — ,
- r-vocab — ,
- r-arpa — ARPA,
- abbrs — , , (, , ...)
- goodwords —
- badwords —
- alters — , ( )
- upwords — , (, , ...)
- mix-restwords —
- alphabet — ( )
- bin-code —
- bin-name —
- bin-author —
- bin-copyright —
- bin-contacts —
- bin-lictype —
- bin-lictext —
- embedding-size —
- embedding — ( , )
Python
import asc
# N- 3
asc.setSize(3)
#
asc.setThreads(0)
# ( )
asc.setLocale("en_US.UTF-8")
#
asc.setOption(asc.options_t.uppers)
# <unk>
asc.setOption(asc.options_t.allowUnk)
# <unk>
asc.setOption(asc.options_t.resetUnk)
#
asc.setOption(asc.options_t.mixDicts)
# ARPA - ,
asc.setOption(asc.options_t.confidence)
# ( )
asc.setAlphabet("abcdefghijklmnopqrstuvwxyz")
# ( )
asc.setPilots(["","","","","","","","","","","a","i","o","e","g"])
#
asc.setSubstitutes({'p':'','c':'','o':'','t':'','k':'','e':'','a':'','h':'','x':'','b':'','m':''})
#
f = open('./texts/whitelist/words.txt')
for word in f.readlines():
word = word.replace("\n", "")
asc.addGoodword(word)
f.close()
#
f = open('./texts/blacklist/garbage.txt')
for word in f.readlines():
word = word.replace("\n", "")
asc.addBadword(word)
f.close()
#
f = open('./output/alm.abbr')
for word in f.readlines():
word = word.replace("\n", "")
asc.addSuffix(word)
f.close()
# , (, , ...)
f = open('./texts/abbrs/abbrs.txt')
for abbr in f.readlines():
abbr = abbr.replace("\n", "")
asc.addAbbr(abbr)
f.close()
# , (, , ...)
f = open('./texts/words/upp/words.txt')
for word in f.readlines():
word = word.replace("\n", "")
asc.addUWord(word)
f.close()
#
asc.addAlt("", "")
# , ( )
f = open('./texts/alters/yoficator.txt')
for words in f.readlines():
words = words.replace("\n", "")
words = words.split('\t')
asc.addAlt(words[0], words[1])
f.close()
def statusIndex(text, status):
print(text, status)
def statusBuildIndex(status):
print("Build index", status)
def statusArpa(status):
print("Read arpa", status)
def statusVocab(status):
print("Read vocab", status)
# ARPA
asc.readArpa("./output/alm.arpa", statusArpa)
#
asc.readVocab("./output/alm.vocab", statusVocab)
#
asc.setCode("RU")
#
asc.setLictype("MIT")
#
asc.setName("Russian")
#
asc.setAuthor("You name")
#
asc.setCopyright("You company LLC")
#
asc.setLictext("... License text ...")
#
asc.setContacts("site: https://example.com, e-mail: info@example.com")
# ( , )
asc.setEmbedding({
"": 0, "": 1, "": 2, "": 3, "": 4, "": 5,
"": 5, "": 6, "": 7, "": 8, "": 8, "": 9,
"": 10, "": 11, "": 12, "": 0, "": 13, "": 14,
"": 15, "": 16, "": 17, "": 18, "": 19, "": 20,
"": 21, "": 21, "": 21, "": 22, "": 23, "": 22,
"": 5, "": 24, "": 25, "<": 26, ">": 26, "~": 26,
"-": 26, "+": 26, "=": 26, "*": 26, "/": 26, ":": 26,
"%": 26, "|": 26, "^": 26, "&": 26, "#": 26, "'": 26,
"\\": 26, "0": 27, "1": 27, "2": 27, "3": 27, "4": 27,
"5": 27, "6": 27, "7": 27, "8": 27, "9": 27, "a": 0,
"b": 2, "c": 15, "d": 4, "e": 5, "f": 18, "g": 3,
"h": 12, "i": 8, "j": 6, "k": 9, "l": 10, "m": 11,
"n": 12, "o": 0, "p": 14, "q": 13, "r": 14, "s": 15,
"t": 16, "u": 24, "v": 21, "w": 22, "x": 19, "y": 17, "z": 7
}, 28)
#
asc.buildIndex(statusBuildIndex)
#
asc.saveIndex("./dictionary/3-middle.asc", "", 128, statusIndex)
Ich verstehe, dass nicht jede Person in der Lage sein wird, ihr eigenes binäres Vokabular zu trainieren. Dies erfordert Textkorpora und erhebliche Rechenressourcen. Daher kann der ASC nur mit einer ARPA- Datei als Hauptwörterbuch arbeiten.
Beispiel der Arbeit
spell.json
Python
{
"ad": 13,
"cw": 38120,
"debug": 1,
"threads": 0,
"method": "spell",
"alter": {"":""},
"asc-split": true,
"asc-alter": true,
"confidence": true,
"asc-esplit": true,
"asc-rsplit": true,
"asc-uppers": true,
"asc-hyphen": true,
"mixed-dicts": true,
"asc-wordrep": true,
"spell-verbose": true,
"r-text": "./texts/test.txt",
"w-text": "./texts/output.txt",
"upwords": "./texts/words/upp",
"r-arpa": "./dictionary/alm.arpa",
"r-abbr": "./dictionary/alm.abbr",
"abbrs": "./texts/abbrs/abbrs.txt",
"alters": "./texts/alters/yoficator.txt",
"mix-restwords": "./similars/letters.txt",
"goodwords": "./texts/whitelist/words.txt",
"badwords": "./texts/blacklist/garbage.txt",
"pilots": ["","","","","","","","","","","a","i","o","e","g"],
"alphabet": "abcdefghijklmnopqrstuvwxyz",
"embedding-size": 28,
"embedding": {
"": 0, "": 1, "": 2, "": 3, "": 4, "": 5,
"": 5, "": 6, "": 7, "": 8, "": 8, "": 9,
"": 10, "": 11, "": 12, "": 0, "": 13, "": 14,
"": 15, "": 16, "": 17, "": 18, "": 19, "": 20,
"": 21, "": 21, "": 21, "": 22, "": 23, "": 22,
"": 5, "": 24, "": 25, "<": 26, ">": 26, "~": 26,
"-": 26, "+": 26, "=": 26, "*": 26, "/": 26, ":": 26,
"%": 26, "|": 26, "^": 26, "&": 26, "#": 26, "'": 26,
"\\": 26, "0": 27, "1": 27, "2": 27, "3": 27, "4": 27,
"5": 27, "6": 27, "7": 27, "8": 27, "9": 27, "a": 0,
"b": 2, "c": 15, "d": 4, "e": 5, "f": 18, "g": 3,
"h": 12, "i": 8, "j": 6, "k": 9, "l": 10, "m": 11,
"n": 12, "o": 0, "p": 14, "q": 13, "r": 14, "s": 15,
"t": 16, "u": 24, "v": 21, "w": 22, "x": 19, "y": 17, "z": 7
}
}
$ ./asc -r-json ./spell.json
Python
import asc
#
asc.setThreads(0)
#
asc.setOption(asc.options_t.uppers)
#
asc.setOption(asc.options_t.ascSplit)
#
asc.setOption(asc.options_t.ascAlter)
#
asc.setOption(asc.options_t.ascESplit)
#
asc.setOption(asc.options_t.ascRSplit)
#
asc.setOption(asc.options_t.ascUppers)
#
asc.setOption(asc.options_t.ascHyphen)
#
asc.setOption(asc.options_t.ascWordRep)
#
asc.setOption(asc.options_t.mixDicts)
# ARPA - ,
asc.setOption(asc.options_t.confidence)
# ( )
asc.setAlphabet("abcdefghijklmnopqrstuvwxyz")
# ( )
asc.setPilots(["","","","","","","","","","","a","i","o","e","g"])
#
asc.setSubstitutes({'p':'','c':'','o':'','t':'','k':'','e':'','a':'','h':'','x':'','b':'','m':''})
#
f = open('./texts/whitelist/words.txt')
for word in f.readlines():
word = word.replace("\n", "")
asc.addGoodword(word)
f.close()
#
f = open('./texts/blacklist/garbage.txt')
for word in f.readlines():
word = word.replace("\n", "")
asc.addBadword(word)
f.close()
#
f = open('./output/alm.abbr')
for word in f.readlines():
word = word.replace("\n", "")
asc.addSuffix(word)
f.close()
# , (, , ...)
f = open('./texts/abbrs/abbrs.txt')
for abbr in f.readlines():
abbr = abbr.replace("\n", "")
asc.addAbbr(abbr)
f.close()
# , (, , ...)
f = open('./texts/words/upp/words.txt')
for word in f.readlines():
word = word.replace("\n", "")
asc.addUWord(word)
f.close()
#
asc.addAlt("", "")
# , ( )
f = open('./texts/alters/yoficator.txt')
for words in f.readlines():
words = words.replace("\n", "")
words = words.split('\t')
asc.addAlt(words[0], words[1])
f.close()
def statusArpa(status):
print("Read arpa", status)
def statusIndex(status):
print("Build index", status)
# ARPA
asc.readArpa("./dictionary/alm.arpa", statusArpa)
# (38120 13 )
asc.setAdCw(38120, 13)
# ( , )
asc.setEmbedding({
"": 0, "": 1, "": 2, "": 3, "": 4, "": 5,
"": 5, "": 6, "": 7, "": 8, "": 8, "": 9,
"": 10, "": 11, "": 12, "": 0, "": 13, "": 14,
"": 15, "": 16, "": 17, "": 18, "": 19, "": 20,
"": 21, "": 21, "": 21, "": 22, "": 23, "": 22,
"": 5, "": 24, "": 25, "<": 26, ">": 26, "~": 26,
"-": 26, "+": 26, "=": 26, "*": 26, "/": 26, ":": 26,
"%": 26, "|": 26, "^": 26, "&": 26, "#": 26, "'": 26,
"\\": 26, "0": 27, "1": 27, "2": 27, "3": 27, "4": 27,
"5": 27, "6": 27, "7": 27, "8": 27, "9": 27, "a": 0,
"b": 2, "c": 15, "d": 4, "e": 5, "f": 18, "g": 3,
"h": 12, "i": 8, "j": 6, "k": 9, "l": 10, "m": 11,
"n": 12, "o": 0, "p": 14, "q": 13, "r": 14, "s": 15,
"t": 16, "u": 24, "v": 21, "w": 22, "x": 19, "y": 17, "z": 7
}, 28)
#
asc.buildIndex(statusIndex)
f1 = open('./texts/test.txt')
f2 = open('./texts/output.txt', 'w')
for line in f1.readlines():
res = asc.spell(line)
f2.write("%s\n" % res[0])
f2.close()
f1.close()
PS Für diejenigen, die überhaupt nichts sammeln und trainieren möchten, habe ich die Webversion von ASC aufgerufen . Es sollte auch berücksichtigt werden, dass das System zur Korrektur von Tippfehlern kein allwissendes System ist und es unmöglich ist, die gesamte russische Sprache dort zu füttern. ASC korrigiert keine Texte, es ist notwendig, für jedes Thema separat zu trainieren.