Entmystifizierung Join in Apache Spark

Hallo Habr. Für zukünftige Studenten des Kurses "Ecosystem Hadoop, Spark, Hive" wurde eine Übersetzung des Materials vorbereitet.



Wir laden auch alle zum Webinar "Testen von Spark-Anwendungen" ein . In dieser offenen Lektion werden wir die Probleme beim Testen von Spark-Anwendungen betrachten: Statistikdaten, teilweise Überprüfung und Start / Stopp schwerer Systeme. Lassen Sie uns die Bibliotheken für die Lösung studieren und Tests schreiben.






Dieser Artikel konzentriert sich ausschließlich auf den Join-Vorgang in Apache Spark und bietet einen Überblick über die Grundlagen, auf denen die Spark Join-Technologie basiert.

Joins werden häufig in typischen Data Mining-Streams verwendet, um zwei Datensätze zu korrelieren. Apache Spark, eine einheitliche Analyse-Engine, bietet auch eine solide Grundlage für die Ausführung einer Vielzahl von Join-Szenarien.





Join , , , , . ( ) Join , Joined . .





Join:

, Join Apache Spark. :





1) : Join. , Join, Join.





2) Join: , , (Join Condition). () , . , : Join Joins.





, , . . , (A.x == B.x) ((A.x == B.x) (A.y == B.y)) -   x, y  A B, Join.





. , . , (A.x < B.x) ((A.x == B.x) (A.y == B.y)) -   x, y  A B, Join.





3) Join type: Join Join Join . Join:





(Inner Join): Inner Join Joined ( Join) .





(Outer Join): Outer Join , . , () .





(Semi Join): Semi Join , , , . , , , (Semi Join) (Anti Join).





: Cross Join , .





Join, Apache Spark Join.





Join

, Join, .





Apache Spark Join. :





  • (Shuffle Hash Join)





  • (Broadcast Hash Join)





  • (Sort Merge Join)





  • (Cartesian Join)





  • (Broadcast Nested Loop Join)





Broadcast Hash Join: «Broadcast Hash Join» ( Join) . - , , -.





“Broadcast Hash Join" . , . Spark , .





Shuffle Hash Join: 'Shuffle Hash Join' () ( , ”Guide to Spark Partitioning ( Spark)”. , , (shuffle) Join.





, , Shuffle Hash Join, , Hash Join. , - , -.





"Shuffle Hash Join" "Broadcast Hash Join". , - . , , Join 'Shuffle Hash Join'. , 'Broadcast Hash Join', Spark .





Sort Merge Join: 'Sort Merge Join' 'Shuffle Hash Join'. () . , , (shuffle) Join.





, , Sort Merge Join , Sort Merge Join.





'Sort Merge Join' 'Shuffle Hash Join' 'Broadcast Hash Join', , 'Sort Merge Join' , 'Shuffle Hash' 'Broadcast Hash'. , 'Shuffle Hash Join', , (shuffle) , , 'Sort Merge Join'.





Cartesian Join: Cartesian Join . . , . , .





Cartesian Join . Join, Cartesian - .





Broadcast Nested Loop Join: 'Broadcast Nested Loop Join' . Nested Loop Join .





«Broadcast Nested Loop Join» , . , , .





Spark Join?

Join Join, , Spark :





Spark Join, :









  • Join









  • Join





  • (Equi or Non-Equi Join)





Spark API Join Join Join. Join, 'broadcast', 'merge', 'shuffle_hash' 'shuffle_replicate_nl', , Join.





, Spark Join :





'Broadcast Hash Join'





  • Equi Join





  • 'Full Outer' Join





, :





  • 'Broadcast', Join - 'Right Outer', 'Right Semi' 'Inner'.





  • , 'spark.sql.autoBroadcastJoinThreshold



    ( 10 )' Join - 'Right Outer', 'Right Semi', or 'Inner'.





  • 'Broadcast' , Join - 'Left Outer', 'Left Semi' 'Inner'.





  • , 'spark.sql.autoBroadcastJoinThreshold



    ( 10 )' Join - 'Left Outer', 'Left Semi', or 'Inner'.





  • 'Broadcast' , Join - 'Left Outer', 'Left Semi', 'Right Outer', 'Right Semi' 'Inner'.





  • , 'spark.sql.autoBroadcastJoinThreshold



    ( 10 )' Join - 'Left Outer', 'Left Semi', 'Right Outer', 'Right Semi' 'Inner'.





'Shuffle Hash Join'





  • Equi Join





  • 'Full Outer' Join





  • 'spark.sql.join.prefersortmergeJoin



    ( true)' false





, :





  • 'shuffle_hash' , Join - 'Right Outer', 'Right Semi', 'Inner'.





  • , , Join - 'Right Outer', 'Right Semi' 'Inner'.





  • 'shuffle_hash' , Join - 'Left Outer', 'Left Semi', 'Inner'.





  • , , Join - 'Left Outer', 'Left Semi', 'Inner'.





  • 'shuffle_hash' , Join - 'Left Outer', 'Left Semi', 'Right Outer', 'Right Semi', 'Inner'.





  • , , Join - 'Left Outer', 'Left Semi', 'Right Outer', 'Right Semi', 'Inner'.





'Sort Merge Join'





  • Equi Join





  • Join Keys, Equi Join,





  • 'spark.sql.join.prefersortmergeJoin ( true)' true.





, :





  • 'merge' , Join .





  • , Join .





'Cartesian Join'





  • 'Inner'





, :





  • 'shuffle_replicate_nl' , Join Equi Non-Equi.





  • , Join Equi Non-Equi.





'Broadcast Nested Loop Join'

'Broadcast Nested Loop Join' - Join ; , 'Broadcast Nested Loop Join' Join Join.





, Join , 'Broadcast Hash Join', 'Sort Merge Join', 'Shuffle Hash Join', 'Cartesian Join'.





Cartesian Broadcast Nested Loop Join, Broadcast Nested Loop Inner, Non-Equi Joins, Cartesian Join, , .





, : Join. , .





, Join Apache Spark. - , , .






« Hadoop, Spark, Hive»





« Spark »








All Articles