Course PySpark for Big Data

  • Content
  • Training
  • Modules
  • General
  • Reviews
  • Certificate
  • Course PySpark for Big Data : Content

    In the course PySpark for Big Data participants learn to use Apache Spark from Python. Apache Spark is a Framework for parallel processing of big data. With PySpark, Apache Spark is integrated with the Python language.

    Spark Architecture

    The course PySpark for Big Data discusses the architecture of Spark, the Spark Cluster Manager and the difference between Batch and Stream Processing.


    After a discussion of the Hadoop Distributed File System, parallel operations and working with RDDs, Resilient Distributed Datasets are discussed in the course PySpark for Big Data. The configuration of PySpark applications via SparkConf and SparkContext is also explained.

    MapReduce en SQL

    Extensive consideration is given to the possible operations on RDDs, including map and reduce. The use of SQL in Spark is also discussed. The GraphX library is discussed and DataFrames is discussed. Iterative algorithms are also treated.

    Mlib library

    Finally the course PySpark for Big Data pays attention to machine learning with the Mlib library.

  • Course PySpark for Big Data : Training

    Audience PySpark for Big Data

    The course PySpark for Big Data is intended for developers and upcoming Data Analysts who want to learn how to use Apache Spark from Python.

    Prerequisites training PySpark for Big Data

    To participate in this course, some experience with programming is beneficial for understanding. Prior knowledge of Python or big data handling with Apache Spark is not required.

    Realization course PySpark for Big Data

    The theory is treated on the basis of presentations. Illustrative demos are used to clarify the concepts discussed. There is ample opportunity to practice and alternate theory and practice. The course times are from 9.30 am to 4.30 pm.

    Certification course PySpark for Big Data

    Participants receive an official certificate PySpark for Big Data after successful completion of the course.

  • Course PySpark for Big Data : Modules

    Module 1 : Python Primer

    Module 2 : Spark Intro

    Module 3 : HDFS

    Python Syntax
    Python Data Types
    List, Tuples, Dictionaries
    Python Control Flow
    Functions and Parameters
    Modules and Packages
    Iterators and Generators
    Python Classes
    Anaconda Environment
    Jupyter Notebooks
    What is Apache Spark?
    Spark and Python
    Py4j Library
    Data Driven Documents
    Real Time Processing
    Apache Hadoop MapReduce
    Cluster Manager
    Batch versus Stream Processing
    PySpark Shell
    Hadoop Environment
    Environment Setup
    Hadoop Stack
    Hadoop Yarn
    Hadoop Distributed File System
    HDFS Architecture
    Parallel Operations
    Working with Partitions
    RDD Partitions
    HDFS Data Locality
    DAG (Direct Acyclic Graph)

    Module 4 : SparkConf

    Module 5 : SparkContext

    Module 6 : RDD’s

    SparkConf Object
    Setting Configuration Properties
    Uploading Files
    Logging Configuration
    Storage Levels
    Serialize RDD
    Replicate RDD partitions
    Main Entry Point
    Worker Nodes
    SparkContext Parameters
    RDD serializer
    JavaSparkContext instance
    Resilient Distributed Datasets
    Key-Value pair RDDs
    Parallel Processing
    Immutability and Fault Tolerance
    Transformation Operations
    Filter, groupBy and Map
    Action Operations
    Caching and persistence
    PySpark RDD Class
    count, collect, foreach,filter
    map, reduce, join, cache

    Module 7 : Spark Processing

    Module 8 : Broadcast and Accumulator

    Module 9 : Algorithms

    SQL support in Spark
    Spark 2.0 Dataframes
    Defining tables
    Importing datasets
    Querying data frames using SQL
    Storage formats
    JSON / Parquet
    GraphX library overview
    GraphX APIs
    Performance Tuning
    Network Traffic
    Disk Persistence
    Data Type Support
    Python’s Pickle Serializer
    Sliding Window Operations
    Multi Batch and State Operations
    Iterative Algorithms
    Graph Analysis
    Machine Learning API
    Random Forest
    Naive Bayes
    Decision Tree
  • Course PySpark for Big Data : General

    Course Forms

    All our courses are classroom courses in which the students are guided through the material on the basis of an experienced trainer with in-depth material knowledge. Theory is always interspersed with exercises.


    We also do custom classes and then adjust the course content to your wishes. On request we will also discuss your practical cases.

    Course times

    The course times are from 9.30 to 16.30. But we are flexible in this. Sometimes people have to bring children to the daycare and other times are more convenient for them. In good consultation we can then agree on different course times.


    We take care of the computers on which the course can be held. The software required for the course has already been installed on these computers. You do not have to bring a laptop to participate in the course. If you prefer to work on your own laptop, you can take it with you if you wish. The required software is then installed at the start of the course.


    Our courses are generally given with Open Source software such as Eclipse, IntelliJ, Tomcat, Pycharm, Anaconda and Netbeans. You will receive the digital course material to take home after the course.


    The course includes lunch that we use in a restaurant within walking distance of the course room.


    The courses are planned at various places in the country. A course takes place at a location if at least 3 people register for that location. If there are registrations for different locations, the course will take place at our main location, Houten which is just below Utrecht. A course at our main location also takes place with 2 registrations and regularly with 1 registration. And we also do courses at the customer’s location if they appreciate that.


    At the end of each course, participants are requested to evaluate the course in terms of course content, course material, trainer and location. The evaluation form can be found at https://www.klantenvertellen.nl/reviews/1039545/spiraltrain?lang=en. The evaluations of previous participants and previous courses can also be found there.


    The intellectual property rights of the published course content, also referred to as an information sheet, belong to SpiralTrain. It is not allowed to publish the course information, the information sheet, in written or digital form without the explicit permission of SpiralTrain. The course content is to be understood as the description of the course content in sentences as well as the division of the course into modules and topics in the modules.

  • Course PySpark for Big Data : Reviews

  • Course PySpark for Big Data : Certificate