ANYKS RechtschreibprĂĽfung

Bild



Hallo, dies ist mein dritter Artikel über Habré. Zuvor habe ich einen Artikel über das ALM-Sprachmodell geschrieben . Jetzt möchte ich Ihnen das ASC- Tippfehlerkorrektursystem vorstellen (implementiert auf Basis von ALM ).



Ja, es gibt eine große Anzahl von Systemen zur Korrektur von Tippfehlern, alle haben ihre eigenen Stärken und Schwächen. Aus offenen Systemen kann ich eines der vielversprechendsten JamSpell herausgreifen , und wir werden es vergleichen. Es gibt auch ein ähnliches System von DeepPavlov , über das viele nachdenken könnten, aber ich habe mich nie damit angefreundet.



Funktionsliste:



  1. Korrektur von Wortfehlern mit einer Differenz von bis zu 4 Levenshtein-Abständen.
  2. Korrektur von Tippfehlern in Wörtern (Einfügen, Löschen, Ersetzen, Neuanordnen) von Zeichen.
  3. Ication fikation angesichts des Kontextes.
  4. Setzen Sie den Fall des ersten Buchstabens des Wortes fĂĽr (Eigennamen und Titel) unter BerĂĽcksichtigung des Kontexts.
  5. Aufteilen der kombinierten Wörter in separate Wörter unter Berücksichtigung des Kontexts.
  6. FĂĽhrt eine Textanalyse durch, ohne den Originaltext zu korrigieren.
  7. Suchen Sie im Text nach Präsenz (Fehler, Tippfehler, falscher Kontext).




UnterstĂĽtzte Betriebssysteme:



  • Mac OS X
  • FreeBSD
  • Linux


Das System ist in C ++ 11 geschrieben, es gibt einen Port fĂĽr Python3



Fertige Wörterbücher



Name Größe (GB) RAM (GB) Größe N-Gramm Sprache
wittenbell-3-big.asc 1,97 15.6 3 RU
wittenbell-3-middle.asc 1.24 9.7 3 RU
mkneserney-3-middle.asc 1,33 9.7 3 RU
wittenbell-3-single.asc 0,772 5.14 3 RU
wittenbell-5-single.asc 1,37 10.7 fĂĽnf RU


Testen



Zum Testen des Systems wurden Daten aus dem Dialog21- Wettbewerb "Tippfehlerkorrektur" 2016 verwendet  . Zum Testen wurde ein trainiertes Binärwörterbuch verwendet:  wittenbell-3-middle.asc

Test durchgeführt Präzision Erinnern FMeasure
Tippfehlerkorrekturmodus 76,97 62,71 69.11
Fehlerkorrekturmodus 73,72 60,53 66,48


Ich denke, es ist nicht notwendig, andere Daten hinzuzufĂĽgen, falls gewĂĽnscht, kann jeder den Test wiederholen. Ich fĂĽge alle Materialien hinzu, die beim Testen unten verwendet werden.



Beim Testen verwendete Materialien







Nun ist es interessant, die Funktionsweise der Systeme zur Korrektur von Tippfehlern selbst unter gleichen Bedingungen zu vergleichen. Wir werden zwei verschiedene Tippfehler mit denselben Textdaten trainieren und einen Test durchfĂĽhren.



Nehmen wir zum Vergleich das oben erwähnte Tippfehlerkorrektursystem JamSpell .



ASC gegen JamSpell



Installation
ASC



$ git clone --recursive https://github.com/anyks/asc.git
$ cd ./asc

$ mkdir ./build
$ cd ./build

$ cmake ..
$ make


JamSpell



$ git clone https://github.com/bakwc/JamSpell.git
$ cd ./JamSpell

$ mkdir ./build
$ cd ./build

$ cmake ..
$ make




Ausbildung
ASC



train.json



{
  "ext": "txt",
  "size": 3,
  "alter": {"":""},
  "debug": 1,
  "threads": 0,
  "method": "train",
  "allow-unk": true,
  "reset-unk": true,
  "confidence": true,
  "interpolate": true,
  "mixed-dicts": true,
  "only-token-words": true,
  "locale": "en_US.UTF-8",
  "smoothing": "wittenbell",
  "pilots": ["","","","","","","","","","","a","i","o","e","g"],
  "corpus": "./texts/correct.txt",
  "w-bin": "./dictionary/3-middle.asc",
  "w-vocab": "./train/lm.vocab",
  "w-arpa": "./train/lm.arpa",
  "mix-restwords": "./similars/letters.txt",
  "alphabet": "abcdefghijklmnopqrstuvwxyz",
  "bin-code": "ru",
  "bin-name": "Russian",
  "bin-author": "You name",
  "bin-copyright": "You company LLC",
  "bin-contacts": "site: https://example.com, e-mail: info@example.com",
  "bin-lictype": "MIT",
  "bin-lictext": "... License text ...",
  "embedding-size": 28,
  "embedding": {
      "": 0, "": 1, "": 2, "": 3, "": 4, "": 5,
      "": 5, "": 6, "": 7, "": 8, "": 8, "": 9,
      "": 10, "": 11, "": 12, "": 0, "": 13, "": 14,
      "": 15, "": 16, "": 17, "": 18, "": 19, "": 20,
      "": 21, "": 21, "": 21, "": 22, "": 23, "": 22,
      "": 5, "": 24, "": 25, "<": 26, ">": 26, "~": 26,
      "-": 26, "+": 26, "=": 26, "*": 26, "/": 26, ":": 26,
      "%": 26, "|": 26, "^": 26, "&": 26, "#": 26, "'": 26,
      "\\": 26, "0": 27, "1": 27, "2": 27, "3": 27, "4": 27,
      "5": 27, "6": 27, "7": 27, "8": 27, "9": 27, "a": 0,
      "b": 2, "c": 15, "d": 4, "e": 5, "f": 18, "g": 3,
      "h": 12, "i": 8, "j": 6, "k": 9, "l": 10, "m": 11,
      "n": 12, "o": 0, "p": 14, "q": 13, "r": 14, "s": 15,
      "t": 16, "u": 24, "v": 21, "w": 22, "x": 19, "y": 17, "z": 7
  }
}


$ ./asc -r-json ./train.json


Python3



import asc

asc.setSize(3)
asc.setAlmV2()
asc.setThreads(0)
asc.setLocale("en_US.UTF-8")

asc.setOption(asc.options_t.uppers)
asc.setOption(asc.options_t.allowUnk)
asc.setOption(asc.options_t.resetUnk)
asc.setOption(asc.options_t.mixDicts)
asc.setOption(asc.options_t.tokenWords)
asc.setOption(asc.options_t.confidence)
asc.setOption(asc.options_t.interpolate)

asc.setAlphabet("abcdefghijklmnopqrstuvwxyz")

asc.setPilots(["","","","","","","","","","","a","i","o","e","g"])
asc.setSubstitutes({'p':'','c':'','o':'','t':'','k':'','e':'','a':'','h':'','x':'','b':'','m':''})

def statusArpa1(status):
    print("Build arpa", status)

def statusArpa2(status):
    print("Write arpa", status)

def statusVocab(status):
    print("Write vocab", status)

def statusIndex(text, status):
    print(text, status)

def status(text, status):
    print(text, status)

asc.collectCorpus("./texts/correct.txt", asc.smoothing_t.wittenBell, 0.0, False, False, status)

asc.buildArpa(statusArpa1)

asc.writeArpa("./train/lm.arpa", statusArpa2)

asc.writeVocab("./train/lm.vocab", statusVocab)

asc.setCode("RU")
asc.setLictype("MIT")
asc.setName("Russian")
asc.setAuthor("You name")
asc.setCopyright("You company LLC")
asc.setLictext("... License text ...")
asc.setContacts("site: https://example.com, e-mail: info@example.com")

asc.setEmbedding({
     "": 0, "": 1, "": 2, "": 3, "": 4, "": 5,
     "": 5, "": 6, "": 7, "": 8, "": 8, "": 9,
     "": 10, "": 11, "": 12, "": 0, "": 13, "": 14,
     "": 15, "": 16, "": 17, "": 18, "": 19, "": 20,
     "": 21, "": 21, "": 21, "": 22, "": 23, "": 22,
     "": 5, "": 24, "": 25, "<": 26, ">": 26, "~": 26,
     "-": 26, "+": 26, "=": 26, "*": 26, "/": 26, ":": 26,
     "%": 26, "|": 26, "^": 26, "&": 26, "#": 26, "'": 26,
     "\\": 26, "0": 27, "1": 27, "2": 27, "3": 27, "4": 27,
     "5": 27, "6": 27, "7": 27, "8": 27, "9": 27, "a": 0,
     "b": 2, "c": 15, "d": 4, "e": 5, "f": 18, "g": 3,
     "h": 12, "i": 8, "j": 6, "k": 9, "l": 10, "m": 11,
     "n": 12, "o": 0, "p": 14, "q": 13, "r": 14, "s": 15,
     "t": 16, "u": 24, "v": 21, "w": 22, "x": 19, "y": 17, "z": 7
}, 28)

asc.saveIndex("./dictionary/3-middle.asc", "", 128, statusIndex)


JamSpell



$ ./main/jamspell train ../test_data/alphabet_ru.txt ../test_data/correct.txt ./model.bin




Testen
ASC



spell.json



{
    "debug": 1,
    "threads": 0,
    "method": "spell",
    "spell-verbose": true,
    "confidence": true,
    "mixed-dicts": true,
    "asc-split": true,
    "asc-alter": true,
    "asc-esplit": true,
    "asc-rsplit": true,
    "asc-uppers": true,
    "asc-hyphen": true,
    "asc-wordrep": true,
    "r-text": "./texts/test.txt",
    "w-text": "./texts/output.txt",
    "r-bin": "./dictionary/3-middle.asc"
}


$ ./asc -r-json ./spell.json


Python3



import asc

asc.setAlmV2()
asc.setThreads(0)

asc.setOption(asc.options_t.uppers)
asc.setOption(asc.options_t.ascSplit)
asc.setOption(asc.options_t.ascAlter)
asc.setOption(asc.options_t.ascESplit)
asc.setOption(asc.options_t.ascRSplit)
asc.setOption(asc.options_t.ascUppers)
asc.setOption(asc.options_t.ascHyphen)
asc.setOption(asc.options_t.ascWordRep)
asc.setOption(asc.options_t.mixDicts)
asc.setOption(asc.options_t.confidence)

def status(text, status):
    print(text, status)

asc.loadIndex("./dictionary/3-middle.asc", "", status)

f1 = open('./texts/test.txt')
f2 = open('./texts/output.txt', 'w')

for line in f1.readlines():
    res = asc.spell(line)
    f2.write("%s\n" % res[0])

f2.close()
f1.close()


JamSpell



- Python , C++



#include <fstream>
#include <iostream>
#include <jamspell/spell_corrector.hpp>

//   BOOST
#ifdef USE_BOOST_CONVERT
	#include <boost/locale/encoding_utf.hpp>
//     
#else
	#include <codecvt>
#endif

using namespace std;

/**
 * convert    utf-8  
 * @param  str  utf-8  
 * @return      
 */
const string convert(const wstring & str){
	//   
	string result = "";
	//   
	if(!str.empty()){
//   BOOST
#ifdef USE_BOOST_CONVERT
		//  
		using boost::locale::conv::utf_to_utf;
		//    utf-8 
		result = utf_to_utf <char> (str.c_str(), str.c_str() + str.size());
//     
#else
		//     UTF-8
		using convert_type = codecvt_utf8 <wchar_t, 0x10ffff, little_endian>;
		//  
		wstring_convert <convert_type, wchar_t> conv;
		// wstring_convert <codecvt_utf8 <wchar_t>> conv;
		//    utf-8 
		result = conv.to_bytes(str);
#endif
	}
	//  
	return result;
}

/**
 * convert      utf-8
 * @param  str   
 * @return       utf-8
 */
const wstring convert(const string & str){
	//   
	wstring result = L"";
	//   
	if(!str.empty()){
//   BOOST
#ifdef USE_BOOST_CONVERT
		//  
		using boost::locale::conv::utf_to_utf;
		//    utf-8 
		result = utf_to_utf <wchar_t> (str.c_str(), str.c_str() + str.size());
//     
#else
		//  
		// wstring_convert <codecvt_utf8 <wchar_t>> conv;
		wstring_convert <codecvt_utf8_utf16 <wchar_t, 0x10ffff, little_endian>> conv;
		//    utf-8 
		result = conv.from_bytes(str);
#endif
	}
	//  
	return result;
}

/**
 * safeGetline     
 * @param  is  
 * @param  t     
 * @return     
 */
istream & safeGetline(istream & is, string & t){
	//  
	t.clear();

	istream::sentry se(is, true);
	streambuf * sb = is.rdbuf();

	for(;;){
		int c = sb->sbumpc();
		switch(c){
 			case '\n': return is;
			case '\r':
				if(sb->sgetc() == '\n') sb->sbumpc();
				return is;
			case streambuf::traits_type::eof():
				if(t.empty()) is.setstate(ios::eofbit);
				return is;
			default: t += (char) c;
		}
	}
}

/**
* main   
*/
int main(){
	//  
	NJamSpell::TSpellCorrector corrector;
	//   
	corrector.LoadLangModel("model.bin");
	//    
	ifstream file1("./test_data/test.txt", ios::in);
	//   
	if(file1.is_open()){
		//    
		string line = "", res = "";
		//    
		ofstream file2("./test_data/output.txt", ios::out);
		//   
		if(file2.is_open()){
			//       
			while(file1.good()){
				//    
				safeGetline(file1, line);
				//   ,  
				if(!line.empty()){
					//   
					res = convert(corrector.FixFragment(convert(line)));
					//   ,    
					if(!res.empty()){
						//   
						res.append("\n");
						//    
						file2.write(res.c_str(), res.size());
					}
				}
			}
			//  
			file2.close();
		}
		//  
		file1.close();
	}
    return 0;
}






$ g++ -std=c++11 -I../JamSpell -L./build/jamspell -L./build/contrib/cityhash -L./build/contrib/phf -ljamspell_lib -lcityhash -lphf ./test.cpp -o ./bin/test

$ ./bin/test




Ergebnisse



Ergebnisse erzielen
$ python3 evaluate.py ./texts/test.txt ./texts/correct.txt ./texts/output.txt




ASC

Präzision Erinnern FMeasure
92.13 82,51 87.05


JamSpell

Präzision Erinnern FMeasure
77,87 63,36 69,87


Eines der Hauptmerkmale von ASC ist das Lernen aus schmutzigen Daten. Es ist praktisch unmöglich, Textkorpora ohne Fehler und Tippfehler im Open Access zu finden. Es reicht nicht aus, um Terabytes an Daten von Hand zu reparieren, aber Sie müssen irgendwie damit arbeiten.



Das Lehrprinzip, das ich anbiete



  1. Zusammenstellen eines Sprachmodells unter Verwendung schmutziger Daten
  2. Wir entfernen alle seltenen Wörter und N-Gramm im zusammengesetzten Sprachmodell
  3. Wir fügen einzelne Wörter für eine korrektere Bedienung des Tippfehlerkorrektursystems hinzu.
  4. Ein binäres Wörterbuch zusammenstellen


Lass uns anfangen



Angenommen, wir haben mehrere Korpusse verschiedener Fächer. Es ist logischer, sie separat zu trainieren und dann zu kombinieren.



Zusammenbau des Chassis mit ALM
collect.json



{
	"size": 3,
	"debug": 1,
	"threads": 0,
	"ext": "txt",
	"method": "train",
	"allow-unk": true,
	"mixed-dicts": true,
	"only-token-words": true,
	"smoothing": "wittenbell",
	"locale": "en_US.UTF-8",
	"w-abbr": "./output/alm.abbr",
	"w-map": "./output/alm.map",
	"w-vocab": "./output/alm.vocab",
	"w-words": "./output/words.txt",
	"corpus": "./texts/corpus",
	"abbrs": "./abbrs/abbrs.txt",
	"goodwords": "./texts/whitelist/words.txt",
	"badwords": "./texts/blacklist/garbage.txt",
	"mix-restwords": "./texts/similars/letters.txt",
	"alphabet": "abcdefghijklmnopqrstuvwxyz"
}


$ ./alm -r-json ./collect.json


  • size — N- 3
  • debug —
  • threads —
  • ext —
  • allow-unk — 〈unk〉
  • mixed-dicts —
  • only-token-words — N- —
  • smoothing — wittenbell ( , - )
  • locale — ( )
  • w-abbr —
  • w-map —
  • w-vocab —
  • w-words — ( )
  • corpus —
  • abbrs — , , (, , ...)
  • goodwords —
  • badwords —
  • mix-restwords —
  • alphabet — ( )


Python

import alm

#   N-  3
alm.setSize(3)
#      
alm.setThreads(0)
#    (  )
alm.setLocale("en_US.UTF-8")
#      (        )
alm.setAlphabet("abcdefghijklmnopqrstuvwxyz")
#     
alm.setSubstitutes({'p':'','c':'','o':'','t':'','k':'','e':'','a':'','h':'','x':'','b':'','m':''})

#    <unk>   
alm.setOption(alm.options_t.allowUnk)
#         
alm.setOption(alm.options_t.mixDicts)
#    N- —       
alm.setOption(alm.options_t.tokenWords)

#    wittenbell (     ,  -    )
alm.init(alm.smoothing_t.wittenBell)

#   ,  ,   (, ,  ...)
f = open('./abbrs/abbrs.txt')
for abbr in f.readlines():
    abbr = abbr.replace("\n", "")
    alm.addAbbr(abbr)
f.close()

#      
f = open('./texts/whitelist/words.txt')
for word in f.readlines():
    word = word.replace("\n", "")
    alm.addGoodword(word)
f.close()

#      
f = open('./texts/blacklist/garbage.txt')
for word in f.readlines():
    word = word.replace("\n", "")
    alm.addBadword(word)
f.close()

def status(text, status):
    print(text, status)

def statusWords(status):
    print("Write words", status)

def statusVocab(status):
    print("Write vocab", status)

def statusMap(status):
    print("Write map", status)

def statusSuffix(status):
    print("Write suffix", status)

#    
alm.collectCorpus("./texts/corpus", status)
#      
alm.writeWords("./output/words.txt", statusWords)
#   
alm.writeVocab("./output/alm.vocab", statusVocab)
#    
alm.writeMap("./output/alm.map", statusMap)
#      
alm.writeSuffix("./output/alm.abbr", statusSuffix)


,



Zusammengebauter Rumpfschnitt mit ALM
prune.json



{
    "size": 3,
    "debug": 1,
    "allow-unk": true,
    "method": "vprune",
    "vprune-wltf": -15.0,
    "locale": "en_US.UTF-8",
    "smoothing": "wittenbell",
    "r-map": "./corpus1/alm.map",
    "r-vocab": "./corpus1/alm.vocab",
    "w-map": "./output/alm.map",
    "w-vocab": "./output/alm.vocab",
    "goodwords": "./texts/whitelist/words.txt",
    "badwords": "./texts/blacklist/garbage.txt",
    "alphabet": "abcdefghijklmnopqrstuvwxyz"
}


$ ./alm -r-json ./prune.json


  • size — N- 3
  • debug —
  • allow-unk — 〈unk〉
  • vprune-wltf — - (, — )
  • locale — ( )
  • smoothing — wittenbell ( , - )
  • r-map —
  • r-vocab —
  • w-map —
  • w-vocab —
  • goodwords —
  • badwords —
  • alphabet — ( )


Python



import alm

#   N-  3
alm.setSize(3)
#      
alm.setThreads(0)
#    (  )
alm.setLocale("en_US.UTF-8")
#      (        )
alm.setAlphabet("abcdefghijklmnopqrstuvwxyz")

#    <unk>   
alm.setOption(alm.options_t.allowUnk)

#    wittenbell (     ,  -    )
alm.init(alm.smoothing_t.wittenBell)

#      
f = open('./texts/whitelist/words.txt')
for word in f.readlines():
    word = word.replace("\n", "")
    alm.addGoodword(word)
f.close()

#      
f = open('./texts/blacklist/garbage.txt')
for word in f.readlines():
    word = word.replace("\n", "")
    alm.addBadword(word)
f.close()

def statusPrune(status):
    print("Prune data", status)

def statusReadVocab(text, status):
    print("Read vocab", text, status)

def statusWriteVocab(status):
    print("Write vocab", status)

def statusReadMap(text, status):
    print("Read map", text, status)

def statusWriteMap(status):
    print("Write map", status)

#  
alm.readVocab("./corpus1/alm.vocab", statusReadVocab)
#    
alm.readMap("./corpus1/alm.map", statusReadMap)
#   
alm.pruneVocab(-15.0, 0, 0, statusPrune)
#   
alm.writeVocab("./output/alm.vocab", statusWriteVocab)
#    
alm.writeMap("./output/alm.map", statusWriteMap)




Kombinierte Daten mit ALM kombinieren
merge.json



{
    "size": 3,
    "debug": 1,
    "allow-unk": true,
    "method": "merge",
    "mixed-dicts": "true",
    "locale": "en_US.UTF-8",
    "smoothing": "wittenbell",
    "r-words": "./texts/words",
    "r-map": "./corpus1",
    "r-vocab": "./corpus1",
    "w-map": "./output/alm.map",
    "w-vocab": "./output/alm.vocab",
    "goodwords": "./texts/whitelist/words.txt",
    "badwords": "./texts/blacklist/garbage.txt",
    "mix-restwords": "./texts/similars/letters.txt",
    "alphabet": "abcdefghijklmnopqrstuvwxyz"
}


$ ./alm -r-json ./merge.json


  • size — N- 3
  • debug —
  • allow-unk — 〈unk〉
  • mixed-dicts —
  • locale — ( )
  • smoothing — wittenbell ( , - )
  • r-words —
  • r-map — ,
  • r-vocab — ,
  • w-map —
  • w-vocab —
  • goodwords —
  • badwords —
  • alphabet — ( )


Python



import alm

#   N-  3
alm.setSize(3)
#      
alm.setThreads(0)
#    (  )
alm.setLocale("en_US.UTF-8")
#      (        )
alm.setAlphabet("abcdefghijklmnopqrstuvwxyz")
#     
alm.setSubstitutes({'p':'','c':'','o':'','t':'','k':'','e':'','a':'','h':'','x':'','b':'','m':''})

#    <unk>   
alm.setOption(alm.options_t.allowUnk)
#         
alm.setOption(alm.options_t.mixDicts)

#    wittenbell (     ,  -    )
alm.init(alm.smoothing_t.wittenBell)

#      
f = open('./texts/whitelist/words.txt')
for word in f.readlines():
    word = word.replace("\n", "")
    alm.addGoodword(word)
f.close()

#      
f = open('./texts/blacklist/garbage.txt')
for word in f.readlines():
    word = word.replace("\n", "")
    alm.addBadword(word)
f.close()

#         
f = open('./texts/words.txt')
for word in f.readlines():
    word = word.replace("\n", "")
    alm.addWord(word)
f.close()

def statusReadVocab(text, status):
    print("Read vocab", text, status)

def statusWriteVocab(status):
    print("Write vocab", status)

def statusReadMap(text, status):
    print("Read map", text, status)

def statusWriteMap(status):
    print("Write map", status)

#   
alm.readVocab("./corpus1", statusReadVocab)
#    
alm.readMap("./corpus1", statusReadMap)
#   
alm.writeVocab("./output/alm.vocab", statusWriteVocab)
#    
alm.writeMap("./output/alm.map", statusWriteMap)




Mit ALM ein Sprachmodell lernen
train.json



{
    "size": 3,
    "debug": 1,
    "allow-unk": true,
    "reset-unk": true,
    "interpolate": true,
    "method": "train",
    "locale": "en_US.UTF-8",
    "smoothing": "wittenbell",
    "r-map": "./output/alm.map",
    "r-vocab": "./output/alm.vocab",
    "w-arpa": "./output/alm.arpa",
    "w-words": "./output/words.txt",
    "alphabet": "abcdefghijklmnopqrstuvwxyz"
}


$ ./alm -r-json ./train.json


  • size — N- 3
  • debug —
  • allow-unk — 〈unk〉
  • reset-unk — , 〈unk〉
  • interpolate —
  • locale — ( )
  • smoothing — wittenbell
  • r-map — ,
  • r-vocab — ,
  • w-arpa — ARPA,
  • w-words — , ( )
  • alphabet — ( )


Python



import alm

#   N-  3
alm.setSize(3)
#      
alm.setThreads(0)
#    (  )
alm.setLocale("en_US.UTF-8")
#      (        )
alm.setAlphabet("abcdefghijklmnopqrstuvwxyz")
#     
alm.setSubstitutes({'p':'','c':'','o':'','t':'','k':'','e':'','a':'','h':'','x':'','b':'','m':''})

#    <unk>   
alm.setOption(alm.options_t.allowUnk)
#      <unk>   
alm.setOption(alm.options_t.resetUnk)
#         
alm.setOption(alm.options_t.mixDicts)
#     
alm.setOption(alm.options_t.interpolate)

#    wittenbell (     ,  -    )
alm.init(alm.smoothing_t.wittenBell)

def statusReadVocab(text, status):
    print("Read vocab", text, status)

def statusReadMap(text, status):
    print("Read map", text, status)

def statusBuildArpa(status):
    print("Build ARPA", status)

def statusWriteMap(status):
    print("Write map", status)

def statusWriteArpa(status):
    print("Write ARPA", status)

def statusWords(status):
    print("Write words", status)

#   
alm.readVocab("./output/alm.vocab", statusReadVocab)
#    
alm.readMap("./output/alm.map", statusReadMap)

#     
alm.buildArpa(statusBuildArpa)

#       ARPA
alm.writeArpa("./output/alm.arpa", statusWriteArpa)

#   
alm.writeWords("./output/words.txt", statusWords)




RechtschreibprĂĽfung ASC-Training
train.json



{
	"size": 3,
	"debug": 1,
	"threads": 0,
	"confidence": true,
	"mixed-dicts": true,
	"method": "train",
	"alter": {"":""},
	"locale": "en_US.UTF-8",
	"smoothing": "wittenbell",
	"pilots": ["","","","","","","","","","","a","i","o","e","g"],
	"w-bin": "./dictionary/3-single.asc",
	"r-abbr": "./output/alm.abbr",
	"r-vocab": "./output/alm.vocab",
	"r-arpa": "./output/alm.arpa",
	"abbrs": "./texts/abbrs/abbrs.txt",
	"goodwords": "./texts/whitelist/words.txt",
	"badwords": "./texts/blacklist/garbage.txt",
	"alters": "./texts/alters/yoficator.txt",
	"upwords": "./texts/words/upp",
	"mix-restwords": "./texts/similars/letters.txt",
	"alphabet": "abcdefghijklmnopqrstuvwxyz",
	"bin-code": "ru",
	"bin-name": "Russian",
	"bin-author": "You name",
	"bin-copyright": "You company LLC",
	"bin-contacts": "site: https://example.com, e-mail: info@example.com",
	"bin-lictype": "MIT",
	"bin-lictext": "... License text ...",
	"embedding-size": 28,
	"embedding": {
	    "": 0, "": 1, "": 2, "": 3, "": 4, "": 5,
	    "": 5, "": 6, "": 7, "": 8, "": 8, "": 9,
	    "": 10, "": 11, "": 12, "": 0, "": 13, "": 14,
	    "": 15, "": 16, "": 17, "": 18, "": 19, "": 20,
	    "": 21, "": 21, "": 21, "": 22, "": 23, "": 22,
	    "": 5, "": 24, "": 25, "<": 26, ">": 26, "~": 26,
	    "-": 26, "+": 26, "=": 26, "*": 26, "/": 26, ":": 26,
	    "%": 26, "|": 26, "^": 26, "&": 26, "#": 26, "'": 26,
	    "\\": 26, "0": 27, "1": 27, "2": 27, "3": 27, "4": 27,
	    "5": 27, "6": 27, "7": 27, "8": 27, "9": 27, "a": 0,
	    "b": 2, "c": 15, "d": 4, "e": 5, "f": 18, "g": 3,
	    "h": 12, "i": 8, "j": 6, "k": 9, "l": 10, "m": 11,
	    "n": 12, "o": 0, "p": 14, "q": 13, "r": 14, "s": 15,
	    "t": 16, "u": 24, "v": 21, "w": 22, "x": 19, "y": 17, "z": 7
	}
}


$ ./asc -r-json ./train.json


  • size — N- 3
  • debug —
  • threads —
  • confidence — ARPA - ,
  • mixed-dicts —
  • alter — ( , , — «»)
  • locale — ( )
  • smoothing — wittenbell ( , - )
  • pilots — ( )
  • w-bin —
  • r-abbr — ,
  • r-vocab — ,
  • r-arpa — ARPA,
  • abbrs — , , (, , ...)
  • goodwords —
  • badwords —
  • alters — , ( )
  • upwords — , (, , ...)
  • mix-restwords —
  • alphabet — ( )
  • bin-code —
  • bin-name —
  • bin-author —
  • bin-copyright —
  • bin-contacts —
  • bin-lictype —
  • bin-lictext —
  • embedding-size —
  • embedding — ( , )


Python



import asc

#   N-  3
asc.setSize(3)
#      
asc.setThreads(0)
#    (  )
asc.setLocale("en_US.UTF-8")

#        
asc.setOption(asc.options_t.uppers)
#    <unk>   
asc.setOption(asc.options_t.allowUnk)
#      <unk>   
asc.setOption(asc.options_t.resetUnk)
#         
asc.setOption(asc.options_t.mixDicts)
#     ARPA -  ,  
asc.setOption(asc.options_t.confidence)

#      (        )
asc.setAlphabet("abcdefghijklmnopqrstuvwxyz")
#     (     )
asc.setPilots(["","","","","","","","","","","a","i","o","e","g"])
#     
asc.setSubstitutes({'p':'','c':'','o':'','t':'','k':'','e':'','a':'','h':'','x':'','b':'','m':''})

#       
f = open('./texts/whitelist/words.txt')
for word in f.readlines():
    word = word.replace("\n", "")
    asc.addGoodword(word)
f.close()

#       
f = open('./texts/blacklist/garbage.txt')
for word in f.readlines():
    word = word.replace("\n", "")
    asc.addBadword(word)
f.close()

#     
f = open('./output/alm.abbr')
for word in f.readlines():
    word = word.replace("\n", "")
    asc.addSuffix(word)
f.close()

#    ,   (, ,  ...)
f = open('./texts/abbrs/abbrs.txt')
for abbr in f.readlines():
    abbr = abbr.replace("\n", "")
    asc.addAbbr(abbr)
f.close()

#     ,       (, , ...)
f = open('./texts/words/upp/words.txt')
for word in f.readlines():
    word = word.replace("\n", "")
    asc.addUWord(word)
f.close()

#   
asc.addAlt("", "")

#       ,     (        )
f = open('./texts/alters/yoficator.txt')
for words in f.readlines():
    words = words.replace("\n", "")
    words = words.split('\t')
    asc.addAlt(words[0], words[1])
f.close()

def statusIndex(text, status):
    print(text, status)

def statusBuildIndex(status):
    print("Build index", status)

def statusArpa(status):
    print("Read arpa", status)

def statusVocab(status):
    print("Read vocab", status)

#        ARPA
asc.readArpa("./output/alm.arpa", statusArpa)
#   
asc.readVocab("./output/alm.vocab", statusVocab)

#     
asc.setCode("RU")
#    
asc.setLictype("MIT")
#   
asc.setName("Russian")
#    
asc.setAuthor("You name")
#   
asc.setCopyright("You company LLC")
#    
asc.setLictext("... License text ...")
#     
asc.setContacts("site: https://example.com, e-mail: info@example.com")

#      ( ,     )
asc.setEmbedding({
    "": 0, "": 1, "": 2, "": 3, "": 4, "": 5,
    "": 5, "": 6, "": 7, "": 8, "": 8, "": 9,
    "": 10, "": 11, "": 12, "": 0, "": 13, "": 14,
    "": 15, "": 16, "": 17, "": 18, "": 19, "": 20,
    "": 21, "": 21, "": 21, "": 22, "": 23, "": 22,
    "": 5, "": 24, "": 25, "<": 26, ">": 26, "~": 26,
    "-": 26, "+": 26, "=": 26, "*": 26, "/": 26, ":": 26,
    "%": 26, "|": 26, "^": 26, "&": 26, "#": 26, "'": 26,
    "\\": 26, "0": 27, "1": 27, "2": 27, "3": 27, "4": 27,
    "5": 27, "6": 27, "7": 27, "8": 27, "9": 27, "a": 0,
    "b": 2, "c": 15, "d": 4, "e": 5, "f": 18, "g": 3,
    "h": 12, "i": 8, "j": 6, "k": 9, "l": 10, "m": 11,
    "n": 12, "o": 0, "p": 14, "q": 13, "r": 14, "s": 15,
    "t": 16, "u": 24, "v": 21, "w": 22, "x": 19, "y": 17, "z": 7
}, 28)

#     
asc.buildIndex(statusBuildIndex)

#     
asc.saveIndex("./dictionary/3-middle.asc", "", 128, statusIndex)




Ich verstehe, dass nicht jede Person in der Lage sein wird, ihr eigenes binäres Vokabular zu trainieren. Dies erfordert Textkorpora und erhebliche Rechenressourcen. Daher kann der ASC nur mit einer ARPA- Datei als Hauptwörterbuch arbeiten.



Beispiel der Arbeit
spell.json



{
    "ad": 13,
    "cw": 38120,
    "debug": 1,
    "threads": 0,
    "method": "spell",
    "alter": {"":""},
    "asc-split": true,
    "asc-alter": true,
    "confidence": true,
    "asc-esplit": true,
    "asc-rsplit": true,
    "asc-uppers": true,
    "asc-hyphen": true,
    "mixed-dicts": true,
    "asc-wordrep": true,
    "spell-verbose": true,
    "r-text": "./texts/test.txt",
    "w-text": "./texts/output.txt",
    "upwords": "./texts/words/upp",
    "r-arpa": "./dictionary/alm.arpa",
    "r-abbr": "./dictionary/alm.abbr",
    "abbrs": "./texts/abbrs/abbrs.txt",
    "alters": "./texts/alters/yoficator.txt",
    "mix-restwords": "./similars/letters.txt",
    "goodwords": "./texts/whitelist/words.txt",
    "badwords": "./texts/blacklist/garbage.txt",
    "pilots": ["","","","","","","","","","","a","i","o","e","g"],
    "alphabet": "abcdefghijklmnopqrstuvwxyz",
    "embedding-size": 28,
    "embedding": {
        "": 0, "": 1, "": 2, "": 3, "": 4, "": 5,
        "": 5, "": 6, "": 7, "": 8, "": 8, "": 9,
        "": 10, "": 11, "": 12, "": 0, "": 13, "": 14,
        "": 15, "": 16, "": 17, "": 18, "": 19, "": 20,
        "": 21, "": 21, "": 21, "": 22, "": 23, "": 22,
        "": 5, "": 24, "": 25, "<": 26, ">": 26, "~": 26,
        "-": 26, "+": 26, "=": 26, "*": 26, "/": 26, ":": 26,
        "%": 26, "|": 26, "^": 26, "&": 26, "#": 26, "'": 26,
        "\\": 26, "0": 27, "1": 27, "2": 27, "3": 27, "4": 27,
        "5": 27, "6": 27, "7": 27, "8": 27, "9": 27, "a": 0,
        "b": 2, "c": 15, "d": 4, "e": 5, "f": 18, "g": 3,
        "h": 12, "i": 8, "j": 6, "k": 9, "l": 10, "m": 11,
        "n": 12, "o": 0, "p": 14, "q": 13, "r": 14, "s": 15,
        "t": 16, "u": 24, "v": 21, "w": 22, "x": 19, "y": 17, "z": 7
    }
}


$ ./asc -r-json ./spell.json


Python



import asc

#      
asc.setThreads(0)
#        
asc.setOption(asc.options_t.uppers)
#   
asc.setOption(asc.options_t.ascSplit)
#   
asc.setOption(asc.options_t.ascAlter)
#      
asc.setOption(asc.options_t.ascESplit)
#      
asc.setOption(asc.options_t.ascRSplit)
#     
asc.setOption(asc.options_t.ascUppers)
#     
asc.setOption(asc.options_t.ascHyphen)
#    
asc.setOption(asc.options_t.ascWordRep)
#         
asc.setOption(asc.options_t.mixDicts)
#     ARPA -  ,  
asc.setOption(asc.options_t.confidence)

#      (        )
asc.setAlphabet("abcdefghijklmnopqrstuvwxyz")
#     (     )
asc.setPilots(["","","","","","","","","","","a","i","o","e","g"])
#     
asc.setSubstitutes({'p':'','c':'','o':'','t':'','k':'','e':'','a':'','h':'','x':'','b':'','m':''})

#       
f = open('./texts/whitelist/words.txt')
for word in f.readlines():
    word = word.replace("\n", "")
    asc.addGoodword(word)
f.close()

#       
f = open('./texts/blacklist/garbage.txt')
for word in f.readlines():
    word = word.replace("\n", "")
    asc.addBadword(word)
f.close()

#     
f = open('./output/alm.abbr')
for word in f.readlines():
    word = word.replace("\n", "")
    asc.addSuffix(word)
f.close()

#    ,   (, ,  ...)
f = open('./texts/abbrs/abbrs.txt')
for abbr in f.readlines():
    abbr = abbr.replace("\n", "")
    asc.addAbbr(abbr)
f.close()

#     ,       (, , ...)
f = open('./texts/words/upp/words.txt')
for word in f.readlines():
    word = word.replace("\n", "")
    asc.addUWord(word)
f.close()

#   
asc.addAlt("", "")

#       ,     (        )
f = open('./texts/alters/yoficator.txt')
for words in f.readlines():
    words = words.replace("\n", "")
    words = words.split('\t')
    asc.addAlt(words[0], words[1])
f.close()

def statusArpa(status):
    print("Read arpa", status)

def statusIndex(status):
    print("Build index", status)

#        ARPA
asc.readArpa("./dictionary/alm.arpa", statusArpa)

#    (38120      13    )
asc.setAdCw(38120, 13)

#      ( ,     )
asc.setEmbedding({
    "": 0, "": 1, "": 2, "": 3, "": 4, "": 5,
    "": 5, "": 6, "": 7, "": 8, "": 8, "": 9,
    "": 10, "": 11, "": 12, "": 0, "": 13, "": 14,
    "": 15, "": 16, "": 17, "": 18, "": 19, "": 20,
    "": 21, "": 21, "": 21, "": 22, "": 23, "": 22,
    "": 5, "": 24, "": 25, "<": 26, ">": 26, "~": 26,
    "-": 26, "+": 26, "=": 26, "*": 26, "/": 26, ":": 26,
    "%": 26, "|": 26, "^": 26, "&": 26, "#": 26, "'": 26,
    "\\": 26, "0": 27, "1": 27, "2": 27, "3": 27, "4": 27,
    "5": 27, "6": 27, "7": 27, "8": 27, "9": 27, "a": 0,
    "b": 2, "c": 15, "d": 4, "e": 5, "f": 18, "g": 3,
    "h": 12, "i": 8, "j": 6, "k": 9, "l": 10, "m": 11,
    "n": 12, "o": 0, "p": 14, "q": 13, "r": 14, "s": 15,
    "t": 16, "u": 24, "v": 21, "w": 22, "x": 19, "y": 17, "z": 7
}, 28)

#     
asc.buildIndex(statusIndex)

f1 = open('./texts/test.txt')
f2 = open('./texts/output.txt', 'w')

for line in f1.readlines():
    res = asc.spell(line)
    f2.write("%s\n" % res[0])

f2.close()
f1.close()




PS Für diejenigen, die überhaupt nichts sammeln und trainieren möchten, habe ich die Webversion von ASC aufgerufen . Es sollte auch berücksichtigt werden, dass das System zur Korrektur von Tippfehlern kein allwissendes System ist und es unmöglich ist, die gesamte russische Sprache dort zu füttern. ASC korrigiert keine Texte, es ist notwendig, für jedes Thema separat zu trainieren.



All Articles