Hallo Habr! Heute ist der letzte Teil des Themas Clustering und Klassifizierung von Big Text-Daten mithilfe von maschinellem Lernen in Java. Dieser Artikel ist eine Fortsetzung des ersten und zweiten Artikels .
Der Artikel beschreibt die Systemarchitektur, den Algorithmus und die visuellen Ergebnisse. Alle Details der Theorie und der Algorithmen finden Sie in den ersten beiden Artikeln.
Systemarchitekturen können in zwei Hauptteile unterteilt werden: Webanwendung und Datencluster- und Klassifizierungssoftware
Der Algorithmus der Software für maschinelles Lernen besteht aus 3 Hauptteilen:
Verarbeitung natürlicher Sprache;
Tokenisierung;
Lemmatisierung;
Stop Listing;
Häufigkeit von Wörtern;
Clustering-Methoden;
TF-IDF;
SVD;
Clustergruppen finden;
Klassifizierungsmethoden - Aylien API.
Verarbeitung natürlicher Sprache
Der Algorithmus beginnt mit dem Lesen von Textdaten. Da unser System eine elektronische Bibliothek ist, sind die Bücher meist im PDF-Format. Die Implementierung und Details der NLP-Verarbeitung können Sie hier lesen .
:
: 4173415 : 88547 : 82294
, , , . , :
characterize, design, space, render, robot, face, alisa, kalegina, university, washington, seattle, washington, grace, schroeder, university, washington, seattle, washington, aidan, allchin, lakeside, also, il, school, seattle, washington, keara, berlin, macalester, college, saint, paul, minnesota, kearaberlingmailcom, maya, cakmak, university, washington, seattle, washington, abstract, face, critical, establish, agency, social, robot, building, expressive, mechanical, face, costly, difficult, robot, build, year, face, ren, der, screen, great, flexibility, robot, face, open, design, space, tablish, robot, character, perceive, property, despite, prevalence, robot, render, face, systematic, exploration, design, space, work, aim, fill, gap, conduct, survey, identify, robot, render, face, code, term, property, statistics
, :
character, design, space, render, robot, face, alisa, kalegina, univers, washington, seattl, washington, grace, schroeder, univers, washington, seattl, washington, grsuwedu, aidan, allchin, lakesid, also, il, school, seattl, washington, keara, berlin, macalest, colleg, saint, paul, minnesota, kearaberlingmailcom, maya, cakmak, univers, washington, seattl, washington, abstract, face, critic, establish, agenc, social, robot, build, express, mechan, face, cost, difficult, mani, robot, built, year, face, ren, dere, screen, great, flexibl, robot, face, open, design, space, tablish, robot, charact, perceiv, properti, despit, preval, robot, render, face, systemat, explor, design, space, work, aim, fill, gap, conduct, survey, identifi, robot, render, face, code, term, properti, statist, common, pattern, observ, data, set, face, conduct, survey, understand, peopl, percep, tion, render, robot, face, identifi, impact, differ, face, featur, survey, result, indic, prefer, vari, level, realism, detail, robot, facecharacter, design, space, render, robot, face, alisa, kalegina, univers, washington, seattl, washington, grace, schroeder, univers, washington, seattl, washington, grsuwedu, aidan, allchin, lakesid, also, il, school, seattl, washington, keara, berlin, macalest, colleg, saint, paul, minnesota, kearaberlingmailcom, maya, cakmak, univers, washington, seattl, washington, abstract, face, critic, establish, agenc, social, robot, build, express, mechan, face, cost, difficult, mani, robot, built, year, face, ren, dere, screen, great, flexibl, robot, face, open, design, space, tablish, robot, charact, perceiv, properti, despit, preval, robot, render, face, systemat, explor, design, space, work, aim, fill, gap, conduct, survey, identifi, robot, render, face, code, term, properti, statist, common, pattern, observ, data, set, face, conduct, survey, understand, peopl, percep, tion, render, robot, face, identifi, impact, differ, face, featur, survey, result, indic, prefer, vari, level, realism, detail, robot, face
tf-idf . HashMap, - , - -.
-:
, , tf-idf. :
-0.0031139399383999997 0.023330604746 -1.3650204652799997E-4
-0.038380206566 0.00104373247064 0.056140327901
-0.006980774822399999 0.073057418689 -0.0035209342337999996
-0.0047152503238 0.0017397257449 0.024816828582999998
-0.005195951771999999 0.03189764447 -5.9991080912E-4
-0.008568593700999999 0.114337675179 -0.0088221197958
-0.00337365927 0.022604474721999997 -1.1457816390099999E-4
-0.03938283525 -0.0012682796482399999 0.0023486548592
-0.034341362795999995 -0.00111758118864 0.0036010404917
-0.0039026609385999994 0.0016699372352999998 0.021206653766000002
-0.0079418490394 0.003116062838 0.072380311755
-0.007021828444599999 0.0036496566028 0.07869801528199999
-0.0030219410092 0.018637386319 0.00102082843809
-0.0042041069026 0.023621439238999998 0.0022947637053
-0.0061050946438 0.00114796066823 0.018477825284
-0.0065708646563999995 0.0022944737838999996 0.035902813761
-0.037790461814 -0.0015372596281999999 0.008878823611899999
-0.13264545848599998 -0.0144908102251 -0.033606397957999995
-0.016229093174 1.41831464625E-4 0.005181988760999999
-0.024075296507999996 -8.708131965899999E-4 0.0034344653516999997
SVD .
, . – , . OrientDB , OrientDB . OrientDB , , , . . .
, .
– . , , DBSCAN. . . r=0.007. 562 80.000 , . , .
max(D) ‒ , . n -
, . – , –
, . 4-. ( > nt)
N‒ - , S ‒ .
, .
– Aylien API
Aylien API . API json , . API . 9 , . POST API:
String queryText = "select DocText from documents where clusters = '" + cluster + "'";
OResultSet resultSet = database.query(queryText);
while (resultSet.hasNext()) {
OResult result = resultSet.next();
String textDoc = result.toString().replaceAll("[\\<||\\>||\\{||\\}]", "").replaceAll("doctext:", "")
.toLowerCase();
keywords.add(textDoc.replaceAll("\\n", ""));
}
ClassifyByTaxonomyParams.Builder classifyByTaxonomybuilder = ClassifyByTaxonomyParams.newBuilder();
classifyByTaxonomybuilder.setText(keywords.toString());
classifyByTaxonomybuilder.setTaxonomy(ClassifyByTaxonomyParams.StandardTaxonomy.IAB_QAG);
TaxonomyClassifications response = client.classifyByTaxonomy(classifyByTaxonomybuilder.build());
for (TaxonomyCategory c : response.getCategories()) {
clusterUpdate.add(c.getLabel());
}
GET, :
. .
. . , . . , . , :
-
- – . , . - , . Vaadin Flow:
:
, .
.
-.
, , , , -.
.
“Technology & Computing”:
:
:
, . . , , . . . . : .
, , , -, tf-idf, . , . DBSCAN . . , , . , , , , ..
, NoSQL , OrinetDB, 4 NoSQL. , . OrientDB , .
Aylien API, . , 100 . , , , k-, . , .