Spark - Part 4 - Structured API Overview

3 août 2022

How data are structured in Apache Spark ? What data types are supported in Spark ? How does Spark execute the code that is written in the cluster ? We answer these questions in this article.
Before jumping onto this, you might want to understand some basics concepts of Spark covered here : https://data-mindset.fr/spark-pour-les-nuls-partie-1/

Dataframes and Datasets

Big data is classified in unstructured data, semi-structured and structured data. Apache Spark has tools for manipulating all kinds of data of those 3 types : log files, CSV files, parquet files, text files, videos, photos, etc.

Apache Spark structured API are mainly dataframes and datasets : collections defined by rows and columns, like tables.

=> Column can be an integer, a string, an array, a map or a null value,

=> Row is a record of data. A line of our data, like in SQL.

Dataframes should be familiar to you if you knows python and pandas. It gives schemas view to the data basically. A schema is composed of the name and the type of each column of our data. For Dataframes, Spark checks the type of data at the runtime of our code : that is why it is untyped API. Whereas, datasets are typed : Spark checks his type at the compile time.