Getting Start With Spark

Notes of Manning spark in action second editon

basic concept

1.1 what is spark

1.2 what is big data

big data is the collection of datasets, available everywhere in the enterprise, aggregated in a single location, on which you can run basic analytics to more
advanced analytics, like machine and deep learning. Those bigger datasets can
become the basis for artificial intelligence (AI). Technologies, size, or number of computers are irrelevant to this concept.

1.3 dataframe

1.3.1 Java perspective

just like a result set which contains data and api;

In Java, a dataframe is implemented as a Dataset (pronounced “a dataset of rows”).

 differences:
 You do not browse through it with a next() method.
 Its API is extensible through user-defined functions (UDFs). You can write or
wrap existing code and add it to Spark. This code will then be accessible in a
distributed mode. You will study UDFs in chapter 16.
 If you want to access the data, you first get the Row and then go through the columns of the row with getters (similar to a ResultSet).
 Metadata is fairly basic, as there are no primary or foreign keys or indexes in Spark.

1.3.2 RDBMS perspective

just like a table which has columns and rows;

 differences:
 Data can be nested, as in a JSON or XML document. Chapter 7 describes ingestion of those documents, and you will use those nested constructs in chapter 13.
 You don’t update or delete entire rows; you create new dataframes.
 You can easily add or remove columns.
 There are no constraints, indices, primary or foreign keys, or triggers on the
dataframe.

1.4 summary

 Spark is an analytics operating system; you can use it to process workloads and
algorithms in a distributed way. And it’s not only good for analytics: you can use
Spark for data transfer, massive data transformation, log analysis, and more.
 Spark supports SQL, Java, Scala, R, and Python as a programming interface, but
in this book, we focus on Java (and sometimes Python).
 Spark’s internal main data storage is the dataframe. The dataframe combines
storage capacity with an API.
 If you have experience with JDBC development, you will find similarities with a
JDBC ResultSet.
 If you have experience with relational database development, you can compare
a dataframe to a table with less metadata.
 In Java, a dataframe is implemented as a Dataset<Row>.
 You can quickly set up Spark with Maven and Eclipse. Spark does not need to be
installed.
 Spark is not limited to the MapReduce algorithm: its API allows a lot of algorithms to be applied to data.
 Streaming is used more and more frequently in enterprises, as businesses want
access to real-time analytics. Spark supports streaming.
 Analytics have evolved from simple joins and aggregations. Enterprises want
computers to think for us; hence Spark supports machine learning and deep
learning.
 Graphs are a special use case of analytics, but nevertheless, Spark supports
them

2.1 mental model

A simple process in three steps: reading the CSV file, performing a simple
concatenation operation, and saving the resulting data in the database