Submitting Jobs to the Stanford Cluster via SLURM


I do a lot of coding at Stanford on my laptop. It's a wonderful machine, but it simply cannot handle huge amounts of number-crunching. Most research organizations provide computing clusters fully equipped with state-of-the-art software in order to help out folks like me.

Clusters are helpful not only when you have one giant job, but also when you have hundreds of small little jobs that you want to run in parallel, rather than running them in series on your own machine.

Today I'm going to share how to submit large number-crunching jobs Stanford's computing cluster. Many of these same steps can be used in similar computing clusters.

Logging In

The first step is to log into the computer cluster. At the time of this writing, the old FarmShare cluster is being phased out and is being replaced by FarmShare2. Log into one of the rice nodes with the following command. (Note that in this tutorial, anywhere that you see jamwheel, you should replace that with your username.)

ssh jamwheel@rice.stanford.edu

You will be prompted to give your password, and then do a 2-factor authentication if you're not already on a trusted domain. Your prompt should now read something like jamwheel@rice08.

Setting up the Environment

The cluster comes with a vanilla version of python3 installed. You can tell this by reading some of the welcome text after typing python3 in the terminal. The cluster also has some scientific python packages which are available through anaconda3. In order to access these, you have to type

module load anaconda3

You can test that it worked by typing python3. You should see that you are running in an environment that has access to packages like numpy and pandas.

You may also want to install your own site packages. We'll cover how to do this in a second, but we should first tell python to look for pacakges in these locations. One of the recommended locations is /home/jamwheel/.local/lib/python3.5/site-packages. We first have to create this directory by typing

mkdir -p /home/jamwheel/.local/lib/python3.5/site-packages

Next we tell python to look in this directory for packages. We can do this by updating the environment variable PYTHONPATH

export PYTHONPATH=/home/jamwheel/.local/lib/python3.5/site-packages:$PYTHONPATH

Typing this in every time you log in may get tedious, so we can store this in a file that gets executed with every new session. Open up ~/.bashrc using your favorite editor (like vi or nano) and append the following commands:

module load anaconda3 export PYTHONPATH=/home/jamwheel/.local/lib/python3.5/site-packages:$PYTHONPATH

Installing Your Code

A lot of my simulations rely on custom packages that I write. I like to keep all of my code in git. It makes it really handy for pushing code from machine to machine. If you want to be able to deploy libraries of code via a git push, here are some quick instructions to set it up. Otherwise, you can skip this section.

These commands create a folder that will hold development code (in this case, a repository called models). Inside of this new folder, we set up an empty git repository, and tell it to update the working tree whenever we push to it.

mkdir -p ~/dev/models
cd ~/dev/models
git init
git config receive.denyCurrentBranch updateInstead

On your local machine, navigate to wherever your git repository is, and issue the following commands

git remote add rice jamwheel@rice.stanford.edu:/home/jamwheel/dev/models
git push rice

If all goes smoothly, you should see the repo reflected in the rice server.

There is one final thing we need to do in order for python3 to find your code. We need to make a symbolic link from the site-packages folder to your dev folder. My convention is to have the repository root live in ~/dev/models and have the actual package in ~/dev/models/models, which allows for any accompanying documentation to live in ~/dev/models/docs. With this structure, we can make the following symbolic link:

ln -s /home/jamwheel/dev/models/models /home/jamwheel/.local/lib/python3.5/site-packages/models

To test this, we can type the following commands and check for errors

python3
>>> import models

Submitting a Job to the Cluster

For simulations, I like to keep a folder called sims that lives under my home directory. I keep files related to each project in subdirectories. Let's create a test project folder by typing

mkdir -p ~/sims/test
cd ~/sims/test
mkdir logs
touch test.py
chmod +x test.py

Then, using your favorite text editor, let's fill``test.py`` with the following contents

#!/usr/bin/env python3

#SBATCH --job-name="test"
#SBATCH -o logs/%j.out
#SBATCH -e logs/%j.out
#SBATCH --mail-type=END,FAIL
#SBATCH --mail-user=jamwheel@stanford.edu

import os
import sys
import pandas as pd
job_id = os.getenv('SLURM_JOBID', 'manual')

print(job_id)
sys.stdout.flush()

The first line tells the interpreter to use python3. The second block are options to send to slurm when we schedule jobs for the cluster. The next block is some example code if you want to access the id of the slurm job. Note that if you run this file manually, SLURM_JOBID will not be set, and the other value (manual) will be returned insetad.

The final paragraph gives an example of out to write output into the log file. We've set SBATCH to echo all output and errors into a file in logs that's identified by the job number at runtime. However, the environment will buffer the output unless you flush it out. If you want to see your output being dumped in live time into your output files as your program runs, you'll need to put a sys.stdout.flush() after any important print statements.