Big Data using Apache Spark with Scala and AWS
About
This course offers you hands-on knowledge to create Apache Spark applications using Scala programming language in a completely case study based approach.
Apache Spark is a fast and general-purpose distributed computing system. It provides high-level APIs in Scala, Java, Python and R, and an optimized engine that supports general execution graphs (DAG). It also supports a rich set of higherlevel APIs and tools including DataFrame for Structured data processing using Object Oriented Programming and SQL, Structered Streaming for realtime stream processing, MLlib for machine learning and GraphX for graph processing.
Note: This is not an introductory or theory based course, though we'll be discussing about the basic comparison among Hadoop, Spark and Flink. We'll also be having some POCs around Hive, Pig, Presto, AWS Athena and Redshift Spectrum.
Learning Outcomes. By the end of this course,
Recommended background
You should have at least some programming experience in any language, basic level should suffice, i.e., variable declaration, conditional expression - if-else, switch statement, control statements, collections, etc.
PART – 1: Getting Started with Spark – Programming RDD using Databricks Notebook
Setting up your 1st Spark cluster on Databricks Community Edition, it’s a free cloud platform. Introduction to Spark RDD. Transformations and Actions. Distribute key-value pairs (Paired RDD).
PART – 2: Setting up your local environment for Spark, Programming Scala and Completing RDD
PART – 3: Working with Structured Data: DataFrame/Dataset. Productionalizing Spark applications on AWS cloud.
PART – 4: Structured Streaming:
PART – 5: Real time case study
To make things fun and interesting, we will introduce multiple datasets coming from disparate data sources – SFTP, MS SQLServer, Amazon S3 and Google Analytics. And create an industry standard ETL pipeline to populate a data mart implemented on Amazon Redshift. Spark_Usecase_Architecture
Regards,
Data Science Monks
Mob +91 9148426497
This course offers you hands-on knowledge to create Apache Spark applications using Scala programming language in a completely case study based approach.
Apache Spark is a fast and general-purpose distributed computing system. It provides high-level APIs in Scala, Java, Python and R, and an optimized engine that supports general execution graphs (DAG). It also supports a rich set of higherlevel APIs and tools including DataFrame for Structured data processing using Object Oriented Programming and SQL, Structered Streaming for realtime stream processing, MLlib for machine learning and GraphX for graph processing.
Note: This is not an introductory or theory based course, though we'll be discussing about the basic comparison among Hadoop, Spark and Flink. We'll also be having some POCs around Hive, Pig, Presto, AWS Athena and Redshift Spectrum.
Learning Outcomes. By the end of this course,
- You will be able to identify the type of data (structured, semi-structured or unstructured) been given to you and choose which of the Spark data abstractions to be used (RDD, DataFrame/Dataset) there.
- Basis the complexity or the nature of the data pipeline you are working with, you’ll be able to assess which techniques to be used there – DataFrame's DSL or SQL, Streaming, etc.
- Basis the nature of the data (Confidential/PII or not) you’ll be able to decide whether to go ahead with on-premise or cloud based solution. You can also estimate the computational resources requirement for the given data volume.
- You would be able to comfortably setup Spark Development Environment in your local system and start working on any given big data application. And later, if you want you can productionalize the same on AWS cloud using AWS EMR cluster and AWS Lambda scripts.
- You’ll be having enough hands-on experience on Spark that you’ll get a feeling as you’ve more than 3 years experience into “Developing Data Pipelines using Big Data Technologies”.
Recommended background
You should have at least some programming experience in any language, basic level should suffice, i.e., variable declaration, conditional expression - if-else, switch statement, control statements, collections, etc.
PART – 1: Getting Started with Spark – Programming RDD using Databricks Notebook
Setting up your 1st Spark cluster on Databricks Community Edition, it’s a free cloud platform. Introduction to Spark RDD. Transformations and Actions. Distribute key-value pairs (Paired RDD).
- Writing your 1st Spark Program using Databricks Notebook– Word Count example
- Creating RDD from different file formats (text file, sequence file, object file, etc.), Scala Collections, etc.
- Optimization using correct partitioning
- Transformation and Actions
- Understanding Cluster Topology
- Pair RDDs. Transformations and Actions on Pair RDDs – groupByKey(), reduceByKey(), aggregateByKey(), cogroup(), foldByKey(), join(), etc.
- Narrow transformation vs Wider transformation
- Spark job execution model, application_id -->> jobs -->> stages -->> tasks
PART – 2: Setting up your local environment for Spark, Programming Scala and Completing RDD
- Scala basics – Conditional Statement, iteration, user defined functions and higher order functions. OOPs Concepts – class, object and trait. Scala collections.
- Introduction to Scala – variable declaration, control statements, loop, pattern matching, higher order functions, function currying and
- Object Orientation – class, object and trait. Constructor, Inheritance
- Companion class/object, case class/object & singleton class
- Scala Collections – mutable & immutable collections, higer order functions
- Handling null values in Scala using Option, Some and None, Tuple
- Exception Handling
PART – 3: Working with Structured Data: DataFrame/Dataset. Productionalizing Spark applications on AWS cloud.
- DataFrames:
- Converting RDDs to Dataframes through implicit schema inference or by specifying explicit schema
- Reading structured data using sparkSession.read.fileFormat() and sparkSession.read.format("fileFormat") family of functions
- sparkSession.sql("select * from parquet.`path`") and all
- Applying transformation on dataframes using
- Domain Specific Language (DSL), e.g. df.select(), df.groupBy($”col1′′).sum($”col2′′), etc.
- Spark SQL, e.g. sparkSession.sql("select * from temp_vw")
- Applying window/analytics function using Dataframe’s DSL and Spark SQL. e.g. rank(), dense_rank(), lead(), lag(), etc.
- Dataset (Typed dataFrame)
- Solving compile time safety, domain object information, functional programming issues with DataFrame.
- typed and untyped transformations
- Encoders
- Interoperability – Converting RDD to Dataset and vice versa, Dataframe to Dataset and vice versa
- Deploying and monitoring spark application using AWS EMR, Lambda, Step Function and CloudWatch
PART – 4: Structured Streaming:
- Understanding the High Level Streaming API in Spark 2.x
- Triggers and Output modes
- Unified APIs for Batch and Streaming
- Building Advanced Streaming Pipelines Using Structured Streaming
- Stateful window operations
- Tumbling and Sliding windows
- Watermarks and late data
- Windowed joins
- Integrating Apache Kafka with Structured Streaming
PART – 5: Real time case study
To make things fun and interesting, we will introduce multiple datasets coming from disparate data sources – SFTP, MS SQLServer, Amazon S3 and Google Analytics. And create an industry standard ETL pipeline to populate a data mart implemented on Amazon Redshift. Spark_Usecase_Architecture
- Set up an Amazon EMR (Elastic Map Reduce) cluster and start Zeppline notebook
- Set up Amazon Redshift database and its client
- Create Spark dataframe out of files from remote SFTP
- Read data from Amazon S3 in Spark
- Create Spark dataframe out of data from MS SQL Server tables
- Read Google Analytics data and process it in Spark
- Schedule Spark job in Amazon Data Pipeline
Regards,
Data Science Monks
Mob +91 9148426497

Great Content. It will useful for knowledge seekers. Keep sharing your knowledge through this kind of article.
ReplyDeleteAzure Training in Chennai
Spark Training Institute in Chennai
Microsoft Azure Training in Chennai
Spark Training Academy Chennai
Azure Courses in Chennai
Best Azure Training in Chennai
Spark Course in Chennai
ReplyDeleteThis blog gives me some valuable information,thanks a lot.
Oracle Certification in Chennai
oracle certification courses online
oracle course in Coimbatore