Spark / R Programming for Data Scientists and Analysts

Spark / R Programming for Data Scientists and Analysts Course Details:

Spark is a highly optimized Data Science environment running on Hadoop YARN, with support for Machine Learning through MLib and Mahout, SQL, DataFrames, and Streaming. In this course, you’ll dive into the details of practical data science on the Spark platform, including real-world interaction with other systems in modern Data Science environments.

No classes are currenty scheduled for this course.

Call (919) 283-1674 to get a class scheduled online or in your area!

Getting Started - Overview

Our Data and our problem set
Accessing the cluster, the data, and the tools
The Continuous Workshop approach
"Let's build a model together"
Focus on analysis, exploration, data munging, algorithms
Tooling and fundamentals as necessary to get the job done

Spark Introduction

Data Science: The State of the Art
Hadoop, Yarn, and Spark
Architectural Overview
MLib Overview
HDFS data - Accessing
Lab Focus
Working with HDFS data
Distributed vs. Local Run Modes
Spark vs. Other tools (when is Spark the right tool for the job?)
Spark vs. SAS
Spark Languages (Java, R, Python, and Scala)
Hello, Spark

Spark Overview

Spark Core
Spark SQL
Spark and Hive
Lab
MLib
Spark Streaming
Spark API

DataFrames

DataFrames and Resilient Distributed Datasets (RDDs)
Partitions
Adding variables to a DataFrame
DataFrame Types
DataFrame Operations
Dependent vs. Independent variables
Map/Reduce with DataFrames

Spark SQL

Spark SQL Overview
Data stores: HDFS, Cassandra, HBase, Hive, and S3
Table Definitions
Queries

Spark MLib

MLib overview
MLib Algorithms Overview
Classification Algorithms
Regression Algorithms
Lab Focus
Brief Comparison to SAS
Here's your split, how to tune regression
Decision Trees and forests
Lab Focus
Brief Comparison to SAS
Stepwise approach to Decision Trees
Working with Exit Criteria
Recommendation with ALS
Clustering Algorithms
Lab Focus
Key Clustering Algorithms
Choosing Clustering Algorithms
Working with key algorithms
Machine Learning Pipelines
Linear Algebra (SVD, PCA)
Statistics in MLib

Spark Streaming

Streaming overview

Streaming with Kafka

Kafka overview
Kafka and Spark Streaming

Data Flow with NiFi

Apache NiFi overview
NiFi data flows with Spark/R

Cluster Mode

Standalone Cluster
Masters and Workers

Spark - the Big Picture

Spark in Real-Time and near-Real-Time Decision Support Systems
Spark in the Enterprise
Best Practices

*Please Note: Course Outline is subject to change without notice. Exact course outline will be provided at time of registration.

Join an engaging hands-on learning environment, where you’ll learn:

The essentials of Spark architecture and applications
How to execute Spark Programs
How to create and manipulate both RDDs (Resilient Distributed Datasets) and UDFs (Unified Data Frames)
How to integrate machine learning into Spark applications
How to use Spark Streaming

This course has a 40% hands-on labs to 60% lecture ratio with engaging instruction, demos, group discussions, labs, and project work.

This “skills-centric” course is about 50% hands-on lab and 50% lecture, designed to train attendees in core R programming and data analytics skills, coupling the most current, effective techniques with the soundest industry practices. Throughout the course students will be led through a series of progressively advanced topics, where each topic consists of lecture, group discussion, comprehensive hands-on lab exercises, and lab review.

Before attending this course, you should have:

Basic R programming experience
Basic knowledge of Statistics and Probability
Data Science background

Data Scientists and Data Analysts

Spark / R Programming for Data Scientists and Analysts

Spark / R Programming for Data Scientists and Analysts Course Details:

Ready to Jumpstart Your IT Career?