June 18, the development team of distributed processing framework Apache Spark announced the latest major release “Apache Spark 3.0.0”.
Apache Spark is an analytics engine for large-scale data processing. It provides libraries for SQL, DatFrames, MLlib for machine learning, and GraphX for graph database. Users can use languages like Java, Scala, Python, R, and SQL to build parallel apps. Spark can work in stand-alone mode or on platforms like Apache Hadoop, Apache Mesos, and Kubernetes. Spark started as a project at the UC Berkley’s AMPlab, then later was donated to Apache Software Foundation (ASF). The project is in its 10th year.
Apache Spark 3 is a major release following Apache Spark 2 series, released in 2016. As being developed as a part of Project Hydrogen, this release adds a new GPU-accelerator aware scheduling. Improvements are added to both cluster manager and scheduler.
There is an additional layer for optimization called Adaptive Query Execution (AQE). It is a layer on top of the Spark Catalyst, and it will modify the Spark plan on the fly. As it introduces dynamic partition pruning filter, it checks whether there are partitioned table and filter in dimension table and does the pruning.
Thanks to these improvements, Sparks 3.0 is nearly twice as fast as Spark 2.4 in TPC-DS 30TB benchmark.
As Spark SQL is the most active component in this release, SQL compatibility has been enhanced. It supports ANSI SQL filer, ANSI SQL OVERLAY function, ANSI SQL: LIKE … ESCAPE syntax, and ANSI SQL Boolean-Predicate syntax. It introduces Spark’s own datetime pattern definition and ANSI store assignment policy for table insertion.
There are many improvements added to streaming, graph, and machine learning.