Big Data using Apache Spark with Scala and AWS

About
This course offers you hands-on knowledge to create Apache Spark applications using Scala programming language in a completely case study based approach.

Apache Spark is a fast and general-purpose distributed computing system. It provides high-level APIs in Scala, Java, Python and R, and an optimized engine that supports general execution graphs (DAG). It also supports a rich set of higherlevel APIs and tools including DataFrame for Structured data processing using Object Oriented Programming and SQL, Structered Streaming for realtime stream processing, MLlib for machine learning and GraphX for graph processing.

Note: This is not an introductory or theory based course, though we'll be discussing about the basic comparison among Hadoop, Spark and Flink. We'll also be having some POCs around Hive, Pig, Presto, AWS Athena and Redshift Spectrum.

Learning Outcomes. By the end of this course,

You will be able to identify the type of data (structured, semi-structured or unstructured) been given to you and choose which of the Spark data abstractions to be used (RDD, DataFrame/Dataset) there.
Basis the complexity or the nature of the data pipeline you are working with, you’ll be able to assess which techniques to be used there – DataFrame's DSL or SQL, Streaming, etc.
Basis the nature of the data (Confidential/PII or not) you’ll be able to decide whether to go ahead with on-premise or cloud based solution. You can also estimate the computational resources requirement for the given data volume.
You would be able to comfortably setup Spark Development Environment in your local system and start working on any given big data application. And later, if you want you can productionalize the same on AWS cloud using AWS EMR cluster and AWS Lambda scripts.
You’ll be having enough hands-on experience on Spark that you’ll get a feeling as you’ve more than 3 years experience into “Developing Data Pipelines using Big Data Technologies”.

Recommended background
You should have at least some programming experience in any language, basic level should suffice, i.e., variable declaration, conditional expression - if-else, switch statement, control statements, collections, etc.

PART – 1: Getting Started with Spark – Programming RDD using Databricks Notebook
Setting up your 1st Spark cluster on Databricks Community Edition, it’s a free cloud platform. Introduction to Spark RDD. Transformations and Actions. Distribute key-value pairs (Paired RDD).

Writing your 1st Spark Program using Databricks Notebook– Word Count example
Creating RDD from different file formats (text file, sequence file, object file, etc.), Scala Collections, etc.
Optimization using correct partitioning
Transformation and Actions
Understanding Cluster Topology
Pair RDDs. Transformations and Actions on Pair RDDs – groupByKey(), reduceByKey(), aggregateByKey(), cogroup(), foldByKey(), join(), etc.
Narrow transformation vs Wider transformation
Spark job execution model, application_id -->> jobs -->> stages -->> tasks

PART – 2: Setting up your local environment for Spark, Programming Scala and Completing RDD

Scala basics – Conditional Statement, iteration, user defined functions and higher order functions. OOPs Concepts – class, object and trait. Scala collections.
Introduction to Scala – variable declaration, control statements, loop, pattern matching, higher order functions, function currying and
Object Orientation – class, object and trait. Constructor, Inheritance
Companion class/object, case class/object & singleton class
Scala Collections – mutable & immutable collections, higer order functions
Handling null values in Scala using Option, Some and None, Tuple
Exception Handling

PART – 3: Working with Structured Data: DataFrame/Dataset. Productionalizing Spark applications on AWS cloud.

DataFrames:

Converting RDDs to Dataframes through implicit schema inference or by specifying explicit schema
Reading structured data using sparkSession.read.fileFormat() and sparkSession.read.format("fileFormat") family of functions
sparkSession.sql("select * from parquet.`path`") and all

Applying transformation on dataframes using

Domain Specific Language (DSL), e.g. df.select(), df.groupBy($”col1′′).sum($”col2′′), etc.
Spark SQL, e.g. sparkSession.sql("select * from temp_vw")

Applying window/analytics function using Dataframe’s DSL and Spark SQL. e.g. rank(), dense_rank(), lead(), lag(), etc.
Dataset (Typed dataFrame)

Solving compile time safety, domain object information, functional programming issues with DataFrame.
typed and untyped transformations
Encoders

Interoperability – Converting RDD to Dataset and vice versa, Dataframe to Dataset and vice versa
Deploying and monitoring spark application using AWS EMR, Lambda, Step Function and CloudWatch

PART – 4: Structured Streaming:

Understanding the High Level Streaming API in Spark 2.x
Triggers and Output modes
Unified APIs for Batch and Streaming
Building Advanced Streaming Pipelines Using Structured Streaming
Stateful window operations
Tumbling and Sliding windows
Watermarks and late data
Windowed joins
Integrating Apache Kafka with Structured Streaming

PART – 5: Real time case study
To make things fun and interesting, we will introduce multiple datasets coming from disparate data sources – SFTP, MS SQLServer, Amazon S3 and Google Analytics. And create an industry standard ETL pipeline to populate a data mart implemented on Amazon Redshift. Spark_Usecase_Architecture

Set up an Amazon EMR (Elastic Map Reduce) cluster and start Zeppline notebook
Set up Amazon Redshift database and its client
Create Spark dataframe out of files from remote SFTP
Read data from Amazon S3 in Spark
Create Spark dataframe out of data from MS SQL Server tables
Read Google Analytics data and process it in Spark
Schedule Spark job in Amazon Data Pipeline

Regards,
Data Science Monks
Mob +91 9148426497

Search This Blog

Big Data using Apache Spark with Scala and AWS

Big Data using Apache Spark with Scala and AWS

Comments

Post a Comment