JumpStart to Developing in Apache Spark

JumpStart to Developing in Apache Spark Course Details:

Apache Spark is an important component in the Hadoop Ecosystem as a cluster computing engine used for Big Data. Building on top of the Hadoop YARN and HDFS ecosystem, Spark offers faster in-memory processing for computing tasks when compared to Map/Reduce. It can be programmed in Java, Scala, Python, and R along with SQL-based front-ends.

With advanced libraries like Mahout and MLib for Machine Learning, GraphX, or Neo4J for rich data graph processing, as well as access to other NoSQL data stores, Rule engines, and components, Spark is a lynchpin in modern Big Data and Data Science computing.

This course introduces you to enterprise-grade Spark programming and the components to craft complete data science solutions. This is a fast-paced course intended to show topical overviews and “big-picture” interactions, while providing you with hands-on experience. This course is offered in Java, and with some alterations, Python, Scala, and R.

No classes are currenty scheduled for this course.

Call (919) 283-1674 to get a class scheduled online or in your area!

Overview of Spark

Hadoop Ecosystem
Hadoop YARN vs. Mesos
Spark vs. Map/Reduce
Spark: Lambda Architecture
Spark in the Enterprise Data Science Architecture

Spark Component Overview

Spark Shell
RDDs: Resilient Distributed Datasets
Data Frames
Spark 2 Unified DataFrames
Spark Sessions
Functional Programming
Spark SQL
MLib
Structured Streaming
Spark R
Spark and Python

RDDs: Resilient Distributed Datasets

Coding with RDDs
Transformations
Actions
Lazy Evaluation and Optimization
RDDs in Map/Reduce

DataFrames

RDDs vs. DataFrames
Unified Dataframes (UDF) in Spark 2.x
Partitioning

DataFrame Persistence

RDD Persistence
DataFrame and Unified DataFrame Persistence
Distributed Persistence

Accessing NoSQL Data

Ingesting data
Relational Databases and Sqoop
Interacting with Hive
Graph Data
Accessing Cassandra Data

Spark SQL

Spark SQL
SQL and DataFrames
Spark SQL and Hive
Spark SQL and JDBC

Machine Learning

ML Lib
Mahout

Spark Streaming

Streaming Overview
Streams
Structured Streaming
Lambda Streaming
Spark and Kafka

*Please Note: Course Outline is subject to change without notice. Exact course outline will be provided at time of registration.

Join an engaging hands-on learning environment, where you’ll learn:

The essentials of Spark architecture and applications
How to execute Spark Programs
How to create and manipulate both RDDs (Resilient Distributed Datasets) and UDFs (Unified Data Frames)
How to persist and restore data frames
Essential NoSQL access
How to integrate machine learning into Spark applications
How to use Spark Streaming and Kafka to create streaming applications

This course has a 50% hands-on labs to 50% lecture ratio with engaging instruction, demos, group discussions, labs, and project work.

If you’re looking to explore Spark and Hadoop in additional depth, consider Developing with Spark for Big Data (8750).

This “skills-centric” course is about 50% hands-on lab and 50% lecture, designed to train attendees in core R programming and data analytics skills, coupling the most current, effective techniques with the soundest industry practices. Throughout the course students will be led through a series of progressively advanced topics, where each topic consists of lecture, group discussion, comprehensive hands-on lab exercises, and lab review.

Before attending this course, you should have:

Java programming experience
Python programming experience
Basic understanding of SQL
Comfort with navigating the Linux command line
Basic knowledge of Linux editors (such as VI/nano) for editing code

Experienced Developers and Architects who seek proficiency in working with Apache Spark in an enterprise data environment.

JumpStart to Developing in Apache Spark

JumpStart to Developing in Apache Spark Course Details:

Ready to Jumpstart Your IT Career?