Apache Spark for Data Scientists Course Details:

Apache Spark is a powerful, open-source processing engine for data in the Hadoop cluster, optimized for speed, ease of use, and sophisticated analytics. The Spark framework supports streaming data processing and complex iterative algorithms, enabling applications to run up to 100x faster than traditional Hadoop MapReduce programs. With Spark, you can write sophisticated applications to execute faster decisions and real-time actions to a wide variety of use cases, architectures, and industries.

This hands-on course explores using Spark for common data related activities from a data science perspective. You will learn to build unified big data applications combining batch, streaming, and interactive analytics on your data.

    No classes are currenty scheduled for this course.

    Call (919) 283-1653 to get a class scheduled online or in your area!


  • Data Science: The State of the Art
  • Hadoop, Yarn, and Spark
  • Architectural Overview
  • Spark and Storm
  • MLib and Mahout
  • Distributed vs. Local Run Modes
  • Hello, Spark

Spark Overview

  • Spark Core
  • Spark SQL
  • Spark and Hive
  • MLib
  • Mahout
  • Spark Streaming
  • Spark API


  • DataFrames and Resilient Distributed Datasets (RDDs)
  • Partitions
  • DataFrame Types
  • DataFrame Operations
  • Map/Reduce with DataFrames

Spark SQL

  • Spark SQL Overview
  • Data stores: HDFS, Cassandra, HBase, Hive, and S3
  • Table Definitions
  • ETL in Spark
  • Queries

Spark MLib

  • MLib overview
  • MLib Algorithms Overview

Spark Streaming

  • Streaming overview
  • Real-time data ingestion
  • State
  • Window Operations

Spark GraphX

  • GraphX overview
  • ETL with GraphX
  • Graph computation

Performance and Tuning

  • Broadcast variables
  • Accumulators
  • Memory Management

Cluster Mode

  • Standalone Cluster
  • Masters and Workers
  • Configurations
  • Working with large data sets

*Please Note: Course Outline is subject to change without notice. Exact course outline will be provided at time of registration.

Join an engaging hands-on learning environment, where you’ll learn:

  • The essentials of Spark architecture and applications
  • How to execute Spark Programs
  • How to create and manipulate both RDDs (Resilient Distributed Datasets) and UDFs (Unified Data Frames)
  • How to integrate machine learning into Spark applications
  • How to use Spark Streaming

This course has a 50% hands-on labs to 50% lecture ratio with engaging instruction, demos, group discussions, labs, and project work.

If you’re looking to explore Spark from a developer perspective, consider JumpStart to Developing in Apache Spark (9360) or Developing with Spark for Big Data (8750).

This “skills-centric” course is about 50% hands-on lab and 50% lecture, designed to train attendees in core Spark and data analytics skills, coupling the most current, effective techniques with the soundest industry practices. Throughout the course students will be led through a series of progressively advanced topics, where each topic consists of lecture, group discussion, comprehensive hands-on lab exercises, and lab review.

Before attending this course, you should have:

  • Introduction to Java Programming (at least exposure to basic Java syntax)
  • Introduction to SQL (familiarity wits SQL basics)
  • Basic knowledge of Statistics and Probability
  • Data Science background

Data Scientists, System Administrators, Testers, and other technical business professionals who seek to use Spark for data processing and analysis.

Ready to Jumpstart Your IT Career?