Clustering und Klassifizierung von Big Text-Daten mit M.O. in Java. Artikel 3 - Architektur / Ergebnisse

Hallo Habr! Heute ist der letzte Teil des Themas Clustering und Klassifizierung von Big Text-Daten mithilfe von maschinellem Lernen in Java. Dieser Artikel ist eine Fortsetzung des  ersten und zweiten Artikels .









Der Artikel beschreibt die Systemarchitektur, den Algorithmus und die visuellen Ergebnisse. Alle Details der Theorie und der Algorithmen finden Sie in den ersten beiden Artikeln.









Systemarchitekturen können in zwei Hauptteile unterteilt werden: Webanwendung und Datencluster- und Klassifizierungssoftware









Der Algorithmus der Software für maschinelles Lernen besteht aus 3 Hauptteilen:





  1. Verarbeitung natürlicher Sprache;





    1. Tokenisierung;





    2. Lemmatisierung;





    3. Stop Listing;





    4. Häufigkeit von Wörtern;





  2. Clustering-Methoden;





    1. TF-IDF;





    2. SVD;





    3. Clustergruppen finden;





  3. Klassifizierungsmethoden - Aylien API.





Verarbeitung natürlicher Sprache

Der Algorithmus beginnt mit dem Lesen von Textdaten. Da unser System eine elektronische Bibliothek ist, sind die Bücher meist im PDF-Format. Die Implementierung und Details der NLP-Verarbeitung können Sie hier lesen .





:





  : 4173415
    : 88547
    : 82294
      
      











, , , . , :





characterize, design, space, render, robot, face, alisa, kalegina, university, washington, seattle, washington, grace, schroeder, university, washington, seattle, washington, aidan, allchin, lakeside, also, il, school, seattle, washington, keara, berlin, macalester, college, saint, paul, minnesota, kearaberlingmailcom, maya, cakmak, university, washington, seattle, washington, abstract, face, critical, establish, agency, social, robot, building, expressive, mechanical, face, costly, difficult, robot, build, year, face, ren, der, screen, great, flexibility, robot, face, open, design, space, tablish, robot, character, perceive, property, despite, prevalence, robot, render, face, systematic, exploration, design, space, work, aim, fill, gap, conduct, survey, identify, robot, render, face, code, term, property, statistics
      
      



, :





character, design, space, render, robot, face, alisa, kalegina, univers, washington, seattl, washington, grace, schroeder, univers, washington, seattl, washington, grsuwedu, aidan, allchin, lakesid, also, il, school, seattl, washington, keara, berlin, macalest, colleg, saint, paul, minnesota, kearaberlingmailcom, maya, cakmak, univers, washington, seattl, washington, abstract, face, critic, establish, agenc, social, robot, build, express, mechan, face, cost, difficult, mani, robot, built, year, face, ren, dere, screen, great, flexibl, robot, face, open, design, space, tablish, robot, charact, perceiv, properti, despit, preval, robot, render, face, systemat, explor, design, space, work, aim, fill, gap, conduct, survey, identifi, robot, render, face, code, term, properti, statist, common, pattern, observ, data, set, face, conduct, survey, understand, peopl, percep, tion, render, robot, face, identifi, impact, differ, face, featur, survey, result, indic, prefer, vari, level, realism, detail, robot, facecharacter, design, space, render, robot, face, alisa, kalegina, univers, washington, seattl, washington, grace, schroeder, univers, washington, seattl, washington, grsuwedu, aidan, allchin, lakesid, also, il, school, seattl, washington, keara, berlin, macalest, colleg, saint, paul, minnesota, kearaberlingmailcom, maya, cakmak, univers, washington, seattl, washington, abstract, face, critic, establish, agenc, social, robot, build, express, mechan, face, cost, difficult, mani, robot, built, year, face, ren, dere, screen, great, flexibl, robot, face, open, design, space, tablish, robot, charact, perceiv, properti, despit, preval, robot, render, face, systemat, explor, design, space, work, aim, fill, gap, conduct, survey, identifi, robot, render, face, code, term, properti, statist, common, pattern, observ, data, set, face, conduct, survey, understand, peopl, percep, tion, render, robot, face, identifi, impact, differ, face, featur, survey, result, indic, prefer, vari, level, realism, detail, robot, face
      
      











tf-idf . HashMap, - , - -.





-:





tf-idf:









, , tf-idf. :





-0.0031139399383999997 0.023330604746 -1.3650204652799997E-4
-0.038380206566 0.00104373247064 0.056140327901
-0.006980774822399999 0.073057418689 -0.0035209342337999996
-0.0047152503238 0.0017397257449 0.024816828582999998
-0.005195951771999999 0.03189764447 -5.9991080912E-4
-0.008568593700999999 0.114337675179 -0.0088221197958
-0.00337365927 0.022604474721999997 -1.1457816390099999E-4
-0.03938283525 -0.0012682796482399999 0.0023486548592
-0.034341362795999995 -0.00111758118864 0.0036010404917
-0.0039026609385999994 0.0016699372352999998 0.021206653766000002
-0.0079418490394 0.003116062838 0.072380311755
-0.007021828444599999 0.0036496566028 0.07869801528199999
-0.0030219410092 0.018637386319 0.00102082843809
-0.0042041069026 0.023621439238999998 0.0022947637053
-0.0061050946438 0.00114796066823 0.018477825284
-0.0065708646563999995 0.0022944737838999996 0.035902813761
-0.037790461814 -0.0015372596281999999 0.008878823611899999
-0.13264545848599998 -0.0144908102251 -0.033606397957999995
-0.016229093174 1.41831464625E-4 0.005181988760999999
-0.024075296507999996 -8.708131965899999E-4 0.0034344653516999997

      
      











SVD   .





, .  – , . OrientDB , OrientDB . OrientDB , , , . . .





, .









– . , , DBSCAN. . . r=0.007. 562 80.000 , . , .





r = max (D) / n









   max(D)  ‒ , . n -













, . – , –









, . 4-. ( > nt)





nt = N / S.

N‒ - , S ‒ .









, .





– Aylien API





Aylien API . API json , . API . 9 , . POST API:





String queryText = "select  DocText from documents where clusters = '" + cluster + "'";
   OResultSet resultSet = database.query(queryText);
   while (resultSet.hasNext()) {
   OResult result = resultSet.next();

   String textDoc = result.toString().replaceAll("[\\<||\\>||\\{||\\}]", "").replaceAll("doctext:", "")
   .toLowerCase();
   keywords.add(textDoc.replaceAll("\\n", ""));
   }

   ClassifyByTaxonomyParams.Builder classifyByTaxonomybuilder    = ClassifyByTaxonomyParams.newBuilder();
   classifyByTaxonomybuilder.setText(keywords.toString());
   classifyByTaxonomybuilder.setTaxonomy(ClassifyByTaxonomyParams.StandardTaxonomy.IAB_QAG);
   TaxonomyClassifications response = client.classifyByTaxonomy(classifyByTaxonomybuilder.build());
   for (TaxonomyCategory c : response.getCategories()) {
   clusterUpdate.add(c.getLabel());
   }

      
      







GET, :









. .













. . , . . , . , :









-





- – . , . - , . Vaadin Flow:









:





  • , .





  • .





  • -.





  • , , , , -.





  • .













“Technology & Computing”:









:









:









, . . , , . . . . : .





, , , -, tf-idf, . , . DBSCAN . . , , . , , , , ..





, NoSQL , OrinetDB, 4 NoSQL. , . OrientDB , .





Aylien API, . , 100 . , , , k-, . , .








All Articles