KSU Big Data on-line course

Comments: DaveTurner@ksu.edu

K-State has remote hosted a Big Data workshop taught by the Pittsburgh Supercomputer Center on their Bridges supercomputer. This course is entirely based on videos from their two day workshop taught April 7-8 of 2020. Anyone taking this on-line course can go through the videos at their own pace, perform the exercises and homework assignments, and even test themselves using the quizes at the end.

Everything needed has been adapted and tested for the Beocat cluster computer at Kansas State University and the BeoShock cluster at Wichita State University. Each user can run their jobs interactively or submit them through the Slurm batch scheduler.

For each section, there is a video to listen to and some PDF slides that you can follow along with, plus directions here on how to do the same work on Beocat/BeoShock. The > sign at the start of lines below represents the command line prompt on Beocat/BeoShock, and >>> represents the prompt you'll get when you start pyspark or python if you run interactively.

Welcome

ssh into Beocat from your computer and copy the workshop data to your home directory.

> cp -rp ~daveturner/public_html/bigdata_workshop .
> cd bigdata_workshop

Welcome video video 1 - start --> 27:30 mark

Intro to Big Data

Intro to Big Data video 1 - 0:38:50 mark --> 1:11:30 mark

Hadoop

Intro to Big Data video 1 - 1:11:30 mark --> 1:21:30

We do not have Hadoop on Beocat so the commands they cover will not work locally. Hadoop is somewhat deprecated, with Spark taking over much of the work.

Introduction to Spark

Introduction to Spark video 1 - 1:21:30 mark --> 2:00:00 mark

The link below shows how to load the Spark and Python modules on Beocat, set up the Python virtual environment, and run Spark code interactively or through the Slurm scheduler.

https://support.beocat.ksu.edu/BeocatDocs/index.php/Installed_software#Spark At the end of video 1 there is a hands-on exercise with 5 parts (1:53:30 mark). You should start by doing these interactively, then at the end you can put them all into a job script to submit to Slurm if you'd like. The first 3 steps of the exercise are fairly straight forward followed by 2 more challenging steps that are designed to teach you some fundamental lessons to working with Spark.

Request 1 core on an Elf node on Beocatfor interactive use then load the modules (on BeoShock leave off the '-C elves' constraint).


  > srun -J srun -N 1 -n 1 -t 24:00:00 --mem=10G -C elves --pty bash

  > module purge
  > module load Spark
  > module load Python

  > source ~/.virtualenvs/spark-test/bin/activate

  > pyspark
  >>>

There are also 3 homework assignments that can be done on top of the 5 exercises.

Shakespeare Homework Problems video 1a - 5.5 minutes

If you are doing this workshop as part of a class, your instructor will give you directions on how to turn in the exercise and homework answers. If you are taking this on your own, answers are available at your request.

Spark - afternoon

Spark - afternoon session video 2 - 1 hour 25 minutes plus 35 minutes questions

Clustering docs

KMeans docs

Machine Learning: Recommender System for Spark

Recommender system video 3 - 1 hour 57 minutes

If you want to run demos and exercises interactively, request 1 core on an Elf node for interactive use then load the modules and activate your Python virtual environment.


  > srun -J srun -N 1 -n 1 -t 24:00:00 --mem=10G -C elves --pty bash

  > module purge
  > module load Spark
  > module load Python

  > source ~/virtualenvs/spark-test/bin/activate    (your python environment may have a different name)

Watch the video 'Machine Learning Recommender System With Spark - Big Data Video 4' (slides are A_Recommender_System.pdf) Do the 3 exercises at 1:06 in the video and email Dan your answers and a summary of how you did on your own.


  Demos and exercises can be run on the node you're on using pyspark-submit

  > pyspark-submit recommender.py

  You can also start pyspark and use it interactively

  > pyspark
  >>>

  Or the recommender.py script can be run using the job script sb.recommender

  > sbatch sb.recommender

Deep Learning with TensorFlow

TensorFlow video 4 - 3 hour 12 minutes

You can do the demos on Beocat if you want. There is a warning that the mnist data will be deprecated in the future.


  > module purge
  > module load TensorFlow

  > source ~/virtualenvs/spark-test/bin/activate

  > python
  >>>

Bridges

TensorFlow Adjourn - ? minutes