big data is the collection of datasets, available everywhere in the enterprise, aggregated in a single location, on which you can run basic analytics to more advanced analytics, like machine and deep learning. Those bigger datasets can become the basis for artificial intelligence (AI). Technologies, size, or number of computers are irrelevant to this concept.
1.3 dataframe
1.3.1 Java perspective
just like a result set which contains data and api;
In Java, a dataframe is implemented as a Dataset (pronounced “a dataset of rows”).
1 2 3 4 5 6 7
differences: You do not browse through it with a next() method. Its API is extensible through user-defined functions (UDFs). You can write or wrap existing code and add it to Spark. This code will then be accessible in a distributed mode. You will study UDFs in chapter 16. If you want to access the data, you first get the Row and then go through the columns of the row with getters (similar to a ResultSet). Metadata is fairly basic, as there are no primary or foreign keys or indexes in Spark.
1.3.2 RDBMS perspective
just like a table which has columns and rows;
1 2 3 4 5 6
differences: Data can be nested, as in a JSON or XML document. Chapter 7 describes ingestion of those documents, and you will use those nested constructs in chapter 13. You don’t update or delete entire rows; you create new dataframes. You can easily add or remove columns. There are no constraints, indices, primary or foreign keys, or triggers on the dataframe.
Spark is an analytics operating system; you can use it to process workloads and algorithms in a distributed way. And it’s not only good for analytics: you can use Spark for data transfer, massive data transformation, log analysis, and more. Spark supports SQL, Java, Scala, R, and Python as a programming interface, but in this book, we focus on Java (and sometimes Python). Spark’s internal main data storage is the dataframe. The dataframe combines storage capacity with an API. If you have experience with JDBC development, you will find similarities with a JDBC ResultSet. If you have experience with relational database development, you can compare a dataframe to a table with less metadata. In Java, a dataframe is implemented as a Dataset<Row>. You can quickly set up Spark with Maven and Eclipse. Spark does not need to be installed. Spark is not limited to the MapReduce algorithm: its API allows a lot of algorithms to be applied to data. Streaming is used more and more frequently in enterprises, as businesses want access to real-time analytics. Spark supports streaming. Analytics have evolved from simple joins and aggregations. Enterprises want computers to think for us; hence Spark supports machine learning and deep learning. Graphs are a special use case of analytics, but nevertheless, Spark supports them
2.1 mental model
A simple process in three steps: reading the CSV file, performing a simple concatenation operation, and saving the resulting data in the database