We first heard of Spark towards the end of 2013. Scala is the language in which Spark is written. Today, Spark is being adopted by major players like eBay, Amazon, Yahoo and many more. Many organizations run Spark on clusters with thousands of nodes.
Spark is an Apache project, advertised as “lightning fast cluster computing”. It has a thriving open-source community and is the most active Apache project at the moment.
Spark provides a faster and a more general data processing platform. It also lets you run programs up to 100x faster in memory or 10x faster on disk, then Hadoop.
Another important aspect while learning how to use Apache Spark is the interactive shell (REPL) which it provides out-of-the-box. Using REPL, one can test the outcome of each line of code without first needing to code and execute the entire job. The path to working code thus becomes shorter, making ad-hoc data analysis possible.
Additional features of Spark
- Currently provides APIs in Scala, Java, Python and R
- Integrates well with the Hadoop ecosystem and data sources (HDFS, Amazon S3, Hive, HBase, Cassandra, etc.)
- Can run on clusters managed by Hadoop YARN or Apache Mesos, and can also run as a standalone application
The Spark core is complemented by a set of powerful, higher-level libraries which can be used in the same application seamlessly. These libraries currently include SparkSQL, Spark Streaming, MLlib (for machine learning), and GraphX, each of which is further detailed in this article. Some more Spark libraries and extensions are currently under development as well.
Spark Core is the base engine for large-scale parallel and distributed data processing. It is responsible for:
- memory management and fault recovery
- scheduling, distributing and monitoring jobs on a cluster
- interacting with storage systems
Spark introduced the concept of an RDD (Resilient Distributed Dataset), an immutable fault-tolerant, distributed collection of objects that can be operated on in parallel. An RDD can contain any type of object and is created by loading an external dataset or distributing a collection from the driver program.
An RDD supports two types of operations:
Transformations in Spark are ‘lazy’, meaning that they do not compute their results right away. Instead, they just ‘remember’ the operation to be performed and the dataset (e.g., file) to which the operation is to be performed. The transformations are only actually computed when an action is called and the result is returned to the driver program. This design enables Spark to run more efficiently. For example, if a big file is transformed in various ways and passed on for the first action, Spark will only process and return the result for the first line, rather than doing the work for the entire file.
SparkSQL is a Spark component that supports querying data either via SQL or via the Hive Query Language. It originated as the Apache Hive port to run on top of Spark (in place of MapReduce) and is now integrated with the Spark stack. In addition to providing support for various data sources, it makes it possible to weave SQL queries with code transformations which result in a very powerful tool.
The Spark Streaming API closely matches that of the Spark Core, making it easy for programmers to work in the worlds of both batch and streaming data.
Spark Streaming supports real time processing of streaming data such as production web server log files (e.g. Apache Flume and HDFS/S3), social media sites like Twitter, and various messaging queues like Kafka. Under the hood, Spark Streaming receives input data streams and divides the data into batches. Next, they get processed by the Spark engine and generate a final stream of results in batches.
MLlib is a machine learning library that provides various algorithms designed to scale out on a cluster for classification, regression, clustering, collaborative filtering, and so on (check out Toptal’s article on machine learning for more information on this topic). Some of these algorithms also work with streaming data, such as linear regression using ordinary least squares or k-means clustering (with more on the way). Apache Mahout (a machine learning library for Hadoop) has already turned away from MapReduce and joined forces on Spark MLlib.
GraphX is a library for manipulating graphs and performing graph-parallel operations. It provides a uniform tool for ETL, exploratory analysis and iterative graph computations. Apart from built-in operations for graph manipulation, it provides a library of common graph algorithms such as PageRank.
Courses at Koenig Solutions:
- Hadoop Developer with Spark
- Data science with SparkML
(Corporate Trainer – Big Data Hadoop)
(Cloudera Certified Instructor)