Introduction to Spark

Video:

Introduction to Spark (KSU Big Data course - video 1, from 1:21:30 to end at 2:00:42)

The link below shows how to load the Spark and Python modules on Beocat, set up the Python virtual environment, and run Spark code interactively or through the Slurm scheduler. https://support.beocat.ksu.edu/BeocatDocs/index.php/Installed_software#Spark

Exercise:

At the end of video 1 there is a hands-on exercise with 5 parts (1:53:30 mark). You should start by doing these interactively, then at the end you can put them all into a job script to submit to Slurm if you’d like. The first 3 steps of the exercise are fairly straight forward followed by 2 more challenging steps that are designed to teach you some fundamental lessons to working with Spark. Request 1 core on Beocat (or Beoshock) for interactive use then load the modules.

> srun -J srun -N 1 -n 1 -t 24:00:00 --mem=10G --pty bash
> module purge
> module load Spark
> source ~/.virtualenvs/python-3.7.4/bin/activate

Demos and exercises can be run on the node you’re on using spark-submit

> spark-submit shakespeare.py

You can also start pyspark and use it interactively:

> pyspark
>>>
Homework:
* Expand for Details If you are doing this workshop as part of a class, follow the instructions below on how to turn in the exercise and homework answers. If you are taking this on your own, answers are available at your request.

Shakespeare Homework Problems video 1a - 5.5 minutes

There are three exercises titled “homework” and five titled “exercises.” The 5 Shakespeare exercises are on page 16 in intro_To_Spark.pdf. The 3 Shakespeare homeworks are on page 19 in the pdf.

You will need to start by getting an input file unique to your username, so your results may not be the same as other students. The commands below can be used to get information on how to use the BigData.py command to get these input files and how to use it to check your results. Since this is an auto-checker routine, you need to be very precise when following these directions.

You will also need to setup a virtual environment as in the Beocat documentations, such as mine named ‘python-3.7.4’ below, and install the numpy and nltk libraries.

module purge
module load Spark
source ~/.virtualenvs/python-3.7.4/bin/activate

/homes/dan/625/BigData/bigdata --help
/homes/dan/625/BigData/bigdata --inputfile shakespeare

/homes/dan/625/BigData/bigdata --check shakespeare_username.py shakespeare_username.results

Note: if working on Beoshock, replace /homes/dan/625/BigData/bigdata with /home/c297w489/BigData/bigdata

The first command provides usage info, the second will return a shakespeare_username.input file to use for these exercises and homeworks, where ‘username’ will be your username, and the third will automatically log your code and check the results you submit. You will get a report of whether your exercises are correct as will Dan. You may continue to submit until you get the results correct if you’d like. Before you submit your code for checking, you must have the Spark module loaded and your virtual environment sourced as in the example above.

The information you provide in shakespeare_username.results should contain 6 lines, with no extraneous information. After exercise 1, you should insert homework 1 and remove all punctuation (use python regex and leave just letters and spaces). After exercise 2, remove the stopwords using nltk stopwords (pyspark StopWordsRemover will give you a different result). Then do the stemming before proceeding to exercise 3.

  • Exercise 1) The number of lines
  • Exercise 2) The number of words
  • Homework 2) The number of words after removing stopwords
  • Exercise 3) The number of distinct words after using stopwords
  • Homework 3) The number of distinct words after applying stemming
  • Exercise 5) show the top 5 most frequent words using .top(5) (cut and paste the output of .top(5) into your results file)
Slides:

The browser that you are using does not have a PDF plugin. You can view the PDF File offline instead