Apache Spark For Java Developers
Posted by Superadmin on November 16 2020 16:44:55

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


1. Welcome



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


2. Downloading the Code.html



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


3. Module 1 - Introduction



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


3.1 Practicals.zip



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


4. Spark Architecture and RDDs



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


1. Warning - Java 91011 is not supported by Spark.html



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


2. Installing Spark



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


1. Reduces on RDDs



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


1. Mapping Operations



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


2. Outputting Results to the Console



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


3. Counting Big Data Items



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


4. If you've had a NotSerializableException in Spark



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


1. RDDs of Objects



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


2. Tuples and RDDs



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


1. Overview of PairRDDs



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


2. Building a PairRDD



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


3. Coding a ReduceByKey



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


4. Using the Fluent API



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


5. Grouping By Key



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


1. FlatMaps



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


2. Filters



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


1. Reading from Disk



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


1. Practical Requirements



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


2. Worked Solution



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


3. Worked Solution (continued) with Sorting



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


1. Why do sorts not work with foreach in Spark



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


2. Why Coalesce is the Wrong Solution



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


3. What is Coalesce used for in Spark



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


1. How to start an EMR Spark Cluster



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


2. Packing a Spark Jar for EMR



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


3. Running a Spark Job on EMR



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


4. Understanding the Job Progress Output



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


5. Calculating EMR costs and Terminating the cluster



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


1. Inner Joins



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


2. Left Outer Joins and Optionals



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


3. Right Outer Joins



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


4. Full Joins and Cartesians



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


1. Introducing the Requirements



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


1.1 Practical Guide.pdf



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


2. Warmup



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


3. Main Exercise Requirments



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


4. Walkthrough - Step 2



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


5. Walkthrough - Step 3



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


6. Walkthrough - Step 4



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


7. Walkthrough - Step 5



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


8. Walkthrough - Step 6



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


9. Walkthrough - Step 7



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


10. Walkthrough - Step 8



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


11. Walkthrough - Step 9, adding titles and using the Big Data file



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


1. Transformations and Actions



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


2. The DAG and SparkUI



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


3. Narrow vs Wide Transformations



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


4. Shuffles



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


5. Dealing with Key Skews



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


6. Avoiding groupByKey and using map-side-reduces instead



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


7. Caching and Persistence



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


1. Code for SQLDataFrames Section.html



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


1.1 biglog.txt



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


1.2 Code.zip



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


2. Introducing SparkSQL



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


1. SparkSQL Getting Started



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


1. Dataset Basics



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


2. Filters using Expressions



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


3. Filters using Lambdas



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


4. Filters using Columns



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


1. Using a Spark Temporary View for SQL



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


1. In Memory Data



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


1. Groupings and Aggregations



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


1. Date Formatting



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


1. Multiple Groupings



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


1. Ordering



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


1. SQL vs DataFrames



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


2. DataFrame Grouping



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


1. How does a Pivot Table work

11Sj3PqT6d7jbm-H_zcDzb6e65rvEmTpS


Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


2. Coding a Pivot Table in Spark



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


1. How to use the agg method in Spark



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


1. Building a Pivot Table with Multiple Aggregations



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


1. How to use a Lambda to write a UDF in Spark



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


2. Using more than one input parameter in Spark UDF



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


3. Using a UDF in Spark SQL



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


1. Understand the SparkUI for SparkSQL



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


2. How does SQL and DataFrame performance compare



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


3. Update - Setting spark.sql.shuffle.partitions



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


1. Explaining Execution Plans



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


2. How does HashAggregation work



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


3. How can I force Spark to use HashAggregation



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


4. SQL vs DataFrames Performance Results



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


1. SparkSQL Performance vs RDDs



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


1. Introducing Linear Regression



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


1. Welcome to Module 3.html



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


1.1 MLCode.zip



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


2. What is Machine Learning



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


3. Coming up in this Module - and introducing Kaggle



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


4. Supervised vs Unsupervised Learning



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


5. The Model Building Process



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


1. Introducing Linear Regression



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


2. Beginning Coding Linear Regressions



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


3. Assembling a Vector of Features



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


4. Model Fitting



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


1. Training vs Test and Holdout Data



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


2. Using data from Kaggle



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


3. Practical Walkthrough



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


4. Splitting Training Data with Random Splits



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


5. Assessing Model Accuracy with R2 and RMSE



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


1. Setting Linear Regression Parameters



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


2. Training, Test and Holdout Data



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


1. Describing the Features



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


2. Correlation of Fetures



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


3. Identifying and Eliminating Duplicated Features



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


4. Data Preparation



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


1. Using OneHotEncoding



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


2. Understanding Vectors



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


1. Pipelines



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


1. Requirements



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


2. Case Study - Walkthrough Part 1



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


3. Case Study - Walkthrough Part 2



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


1. Code for chapters 9-12.html



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


1.1 MLCodeChapters9-12.zip



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


2. TrueFalse Negatives and Postives



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


3. Coding a Logistic Regression



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


1. Overview of Decision Trees



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


2. Building the Model



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


3. Interpreting a Decision Tree



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


4. Random Forests



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


1. K Means Clustering



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


1. Overview and Matrix Factorisation



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


2. Building the Model



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


1. Welcome to Module 4 - Spark Streaming



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


1.1 Code.zip



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


2. Streaming Chapter 1 - Introduction to Streaming



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


3. DStreams



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


3.1 LoggingServer.zip



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


4. Starting a Streaming Job



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


5. Streaming Transformations



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


6. Streaming Aggregations



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


7. SparkUI for Streaming Jobs



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


8. Windowing Batches



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


1. Overview of Kafka



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


2. Installing Kafka



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


3. Using a Kafka Event Simulator



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


3.1 viewing-figures-generation.zip



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


4. Integrating Kafka with Spark



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


5. Using KafkaUtils to access a DStream



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


6. Writing a Kafka Aggegration



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


7. Adding a Window



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


8. Adding a Slide Interval



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


1. Structured Streaming Overview



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


2. Data Sinks



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


3. Structured Streaming Output Modes



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


4. Windows and Watermarks



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


5. What is the Batch Size in Structured Streaming



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming

Apache Spark For Java Developers

with Richard Chesterwood, Matt Greencroft, Virtual Pair Programmers


6. Kafka Structured Streaming Pipelines



Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers. If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.
You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

What you’ll learn
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data.
  • See how Structured Streaming can be used to build pipelines with Kafka
Requirements
  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you’ve never used it before this will be a good first experience

      
Course Contents
01 Introduction 02 Getting Started 03 Reduces on RDDs 04 Mapping and Outputting 05 Tuples 06 PairRDDs 07. FlatMaps and Filters 8. Reading from Disk 9. Keyword Ranking Practical 10. Sorts and Coalesce 11. Deploying to AWS EMR (Optional) 12. Joins 13. Big Data Big Exercise 14. RDD Performance 15. Module 2 - Chapter 1 SparkSQL Introduction 16. SparkSQL Getting Started 17. Datasets 18. The Full SQL Syntax 19. In Memory Data 20. Groupings and Aggregations 21. Date Formatting 22. Multiple Groupings 23. Ordering 24. DataFrames API 25. Pivot Tables 26. More Aggregations 27. Practical Exercise 28. User Defined Functions 29. SparkSQL Performance 30. HashAggregation 31. SparkSQL Performance vs RDDs 32. Module 3 - SparkML for Machine Learning 33. Linear Regression Models 34. Training Data 35. Model Fitting Parameters 36. Feature Selection 37. Non-Numeric Data 38. Pipelines 39. Case Study 40. Logistic Regression 41. Decision Trees 42. K Means Clustering 43. Recommender Systems 44. Module 4 -Spark Streaming and Structured Streaming with Kafka 45. Streaming Chapter 2 - Streaming with Apache Kafka 46. Streaming Chapter 3- Structured Streaming