Kubernetes Jobs on GCP 25 Jan 2020

Disclaimer: these instructions were tested for ubuntu 16.04 and ubuntu 18.04. Specific commands might not generalize for different operating systems.

The Google Compute Platform (GCP) cloud provides powerful computational resources for big data analysis. This tutorial explains how to run a batch of compute jobs on the Google Compute Platform (GCP) cloud using kubernetes, docker, and gcsfuse.

For the purposes of this tutorial, we will perform the task of running the SPAdes genomic assembler on a number of genomic (fastq) source files.

Install the needed prerequisites:

Install gcloud and gsutil

curl https://sdk.cloud.google.com | bash
exec -l $SHELL
gcloud init 

Authenticate with gcloud

gcloud auth login

You might need to configure your default gcp project. First check the configuration settings:

gcloud config list

If the proper project is not selected, you can change it:

gcloud config projects list
gcloud config set project "my-project"

install kube

It is recommended to install kube through gcloud if you will use it with GCP:

gcloud components install kubectl

A stand-alone installer is also available, but not recommended for GCP clusters: https://kubernetes.io/docs/tasks/tools/install-kubectl/

install gcsfuse for mounting gcp buckets

RUN apt-get update && apt-get install --yes --no-install-recommends \
    ca-certificates \
    curl \
    gnupg \
    && echo "deb http://packages.cloud.google.com/apt $GCSFUSE_REPO main" | tee /etc/apt/sources.list.d/gcsfuse.list \
    && curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add - \
    && apt-get update \
    && apt-get install gcsfuse -y \
    && apt-get clean && rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/* 

Create a service account for mounting the bucket: https://console.cloud.google.com/iam-admin/serviceaccounts?project=[add your project id here] (replace ‘gbsc-gcp-lab-kundaje’ with the relevant project id)

Create a service account, and download a key in json form.

go_to_service_account

create_key

create_key2

install docker

sudo apt install docker.io
sudo systemctl start docker
sudo systemctl enable docker

Create your kubernetes cluster

You have now installed all of the tools we will use in this tutorial – on to kuberenetes-ing.

gcloud container clusters create \
       --machine-type n1-highmem-32 \
       --num-nodes 1 \
       --enable-autoscaling \
       --min-nodes=0 \
       --max-nodes=60 \
       --zone us-central1-b \
       --cluster-version latest \
       --disk-size=20Gi \
       spades

Let’s examine what each of these flags are:

machine-type: determines the available cores and RAM. The available machine types are listed here: https://cloud.google.com/compute/docs/machine-types
num-nodes: the initial number of nodes to create in our cluster (within the default node pool).
enable-autoscaling: allows the cluster to grow and shrink in number of nodes based on compute demands
min-nodes: the minimum number of nodes the cluster can reach when autoscaling is enabled (I like to set this to 0 to avoid wasting resources).
max-nodes: the maximum number of compute nodes that can be created in the cluster.
zone: geographic zone in which the cluster is create (generally this only matters for compatibilitly with other resources such as storage buckets)
disk-size: the HDD size of each node in the cluster. Since we will be running our jobs inside docker images and mounting data from buckets, this value can be relatively small.

The name of the cluster we created is “spades”.

set your GCP account as an admin account on the cluster:

kubectl create clusterrolebinding cluster-admin-binding \
   --clusterrole=cluster-admin \
   --user=annashch@stanford.edu

Make sure kubectl is pointing to the cluster you just created:

gcloud container clusters get-credentials spades --zone us-central1-b

Create and populate GCP storage bucket

gsutil mb -l us-central1-b gs://keratinocytes/

This creates a bucket called “keratinocytes” in the compute zone “us-central1-b”. Make sure the zone name matches what you set in the cluster specification above.

Copy local files into your bucket:

gsutil cp -r [local_files_or_folder] gs://keratinocytes/

make sure the files were copied

gsutil ls gs://kerationcytes/

Verify that the bucket can be mounted locally with gcsfuse

First, obtain a json key file for mounting the bucket (see above)

mkdir /mnt/data
gcsfuse --key-file key.json --implicit-dirs  -o allow_other keratinocytes /mnt/data
ls /mnt/data

Note: it is not intuitive that the bucket name does not need to be prefixed by “gs://” when the mount command is executed. To unmount the bucket, run:

fusermount -u /mnt/data

Create an account on dockerhub

Refer to instructions here: https://hub.docker.com/signup Dockerhub accounts are free

Generate a dockerfile with your compute image

This is an example for our specific task. In this dockerfile we:

1) start with a base ubuntu bioinic image 2) Install the gcsfuse software for mounting gcp buckets. (We already did this on our local machine, the same series of commands can be added to the Dockerfile to install gcsfuse within the docker image). 3) copy the key.json file for mounting gcfuse buckets to the image 4) copy additional scripts for mounting the gs://keratinocytes bucket, unmounting the bucket, and running our script of interest (the SPAdes assembler in this case). 5) Install the spades assembler.

Dockerfile:

FROM ubuntu:bionic

ENV GCSFUSE_REPO=gcsfuse-bionic

RUN apt-get update && apt-get install --yes --no-install-recommends \
    ca-certificates \
    curl \
    gnupg \
    wget \
    python \
    python-dev \
    && echo "deb http://packages.cloud.google.com/apt $GCSFUSE_REPO main" | tee /etc/apt/sources.list.d/gcsfuse.list \
    && curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add - \
    && apt-get update \
    && apt-get install gcsfuse -y \
    && apt-get clean && rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*
RUN mkdir /mnt/data


COPY key.json /etc/key.json
COPY mount.sh /opt/mount.sh
COPY umount.sh /opt/umount.sh


#install spades

WORKDIR /opt
RUN wget http://cab.spbu.ru/files/release3.14.0/SPAdes-3.14.0-Linux.tar.gz
RUN tar -xzvf SPAdes-3.14.0-Linux.tar.gz
ENV PATH="/opt/SPAdes-3.14.0-Linux/bin:${PATH}"
COPY script.sh /opt/script.sh

RUN chmod 777 /opt/*sh
CMD ["sleep", "360"]

In this Dockerfile, we copy “key.json”, “mount.sh”, “umoount.sh”, and “script.sh” to the docker image. The key file is the same as you obtained in step XX.

The contents of the other files are as follows:

mount.sh:

#!/bin/bash
gcsfuse --key-file /etc/key.json --implicit-dirs  -o allow_other keratinocytes /mnt/data

umount.sh:

#!/bin/bash
fusermount -u /mnt/data

script.sh:

#!/bin/bash
echo "running mount" 
bash /opt/mount.sh
echo "mount successful"
echo "running spades"
spades.py --pe1-1 /mnt/data/$1.end1.trimmed.fastq.gz --pe1-2 /mnt/data/$1.end2.trimmed.fastq.gz -t 20 -o /mnt/data/SPAdes-3.14-$1
echo "ran ls" 
bash /opt/umount.sh
echo "unmounting"

Now, build and push your image to dockerhub

cd <directory-where-Dockerfile-is-stored>
docker build -t spades_gcp . 

Use the docker images command to view your newly created image:

docker images

The output will look similar to this:

REPOSITORY                                           TAG                             IMAGE ID            CREATED             SIZE
kundajelab/spades_gcp                                latest                          eb64bbfe1edb        16 hours ago        288MB

Tag your docker image and push it to dockerhub

docker tag eb64bbfe1edb kundajelab/spades_gcp:eb64bbfe1edb

The syntax above indicates that image with id eb64bbfe1edb will be pushed to the kundajelab dockerhub organization (you can use your own user account in place of kundajelab), to a repository called spades_gcp, with unique tag eb64bbfe1edb. The tag can be whatever you wish, a common practice is to use the word “latest” – but my preference is to have each remote tag be a unique identifier.

Now, login to dockerhub and push your built image:

docker login

You will be prompted to provide the username and password you created for dockerhub above.

docker push kundajelab/spades_gcp:eb64bbfe1edb

Create kubernetes yaml files to submit your job – If you only need to run a handful of jobs

spades_job.yaml:

apiVersion: batch/v1
kind: Job
metadata:
  name: spades1
spec:
  ttlSecondsAfterFinished: 30
  template:
    spec:
      containers:
      - name: spades
        image: kundajelab/spades_gcp:latest
        resources:
          requests:
            memory: 50Gi
            cpu: 15
          limits:
            memory: 100Gi
            cpu: 15
        securityContext:
          privileged: true
          capabilities:
            add:
              - SYS_ADMIN
        command: ["/opt/script.sh"]
        args: ["keratinocytes-0.5d-rep1"]
      restartPolicy: OnFailure
  backoffLimit: 1

Let’s break down what we have specified in the yaml file:

metadata.name – the name of the job (spades1)
spec (outer) is the specification of the Job:
- ttlSecondsAfterFinished – cleans up resources for finished jobs within 30 seconds after they are complete
- template – template for the Pod that will execute the Job
- backOfLimit – number of times to try running the Job before giving up
Pod template:
- spec (inner) is the specification of the Pod
- restartPolicy – the Pod will be restarted if it fails (and the backOfLimit is not yet exceeded)
spec (inner) is the specification of hte Pod:
- container – this is a virtual image for running the job
  - image – this is the docker image we pushed to dockerhub (make sure this image is set to Public on Dockerhub)
  - resources – requested and maximum CPUs and RAM resources for the job
  - securityContext – run the job with admin privileges in the docker container (this is needed for mounting buckets with gcsfuse)
- command – command to execute in the Job
- args – arguments passed to the command

Submit the job as follows:

kubectl apply -f spades_job.yaml

monitor job execution

kubectl get all

gives:

NAME               READY   STATUS    RESTARTS   AGE
pod/spades-hsrx5   1/1     Running   0          14h

NAME                 TYPE        CLUSTER-IP    EXTERNAL-IP   PORT(S)   AGE
service/kubernetes   ClusterIP   10.35.240.1   <none>        443/TCP   15h

NAME               COMPLETIONS   DURATION   AGE
job.batch/spades   0/1           14h        14h

Use the “describe” command to check the status of a job or pod.

kubectl describe job.batch/spades

tells us the job is running:

Name:           spades
Namespace:      default
Selector:       controller-uid=059cc91e-0e02-4ecd-89aa-3d347c55579e
Labels:         controller-uid=059cc91e-0e02-4ecd-89aa-3d347c55579e
                job-name=spades
Annotations:    kubectl.kubernetes.io/last-applied-configuration:
                  {"apiVersion":"batch/v1","kind":"Job","metadata":{"annotations":{},"name":"spades","namespace":"default"},"spec":{"backoffLimit":1,"templa...
Parallelism:    1
Completions:    1
Start Time:     Sat, 25 Jan 2020 03:16:44 -0800
Pods Statuses:  1 Running / 0 Succeeded / 0 Failed
...

And same for a pod:

kubectl describe pod/spades-hsrx5

Name:           spades-hsrx5
Namespace:      default
Priority:       0
Node:           gke-spades-default-pool-eeea0ebb-3144/10.128.0.42
Start Time:     Sat, 25 Jan 2020 03:16:44 -0800
Labels:         controller-uid=059cc91e-0e02-4ecd-89aa-3d347c55579e
                job-name=spades
Annotations:    kubernetes.io/limit-ranger: LimitRanger plugin set: cpu request for container spades
Status:         Running
...

Use the “logs” command to view a pod’s logs.

kubectl logs pod/spades-hsrx5

running mount
Using mount point: /mnt/data
Opening GCS connection...
Opening bucket...
Mounting file system...
File system has been successfully mounted.
mount successful
running spades
...

Submitting a large number of jobs

This gets tricky… GCP and AWS functionality is largely analogus with the exception of batch job submission, as documented in this service comparison released by GCP in November 2018

Some tools that may help achieve the desired SLURM-like functionality we have come to know and love are:

These are useful tools for production-quality kubernetes clusters and worth learning – but the learning curve can be a bit steep. (Comment below if you’ve had success running GCP batch jobs with these tools).

If you’re a grad student trying to process some data or train some machine learning models as painlessly as possible, a “hack” is as follows:

1) Create a template yaml file with placeholders for the job name and the script arguments, as follows:

spades_job.template.yaml:

apiVersion: batch/v1
kind: Job
metadata:
  name: NAME
spec:
  ttlSecondsAfterFinished: 30
  template:
    spec:
      containers:
      - name: spades
        image: kundajelab/spades_gcp:latest
        resources:
          requests:
            memory: 50Gi
            cpu: 15
          limits:
            memory: 100Gi
            cpu: 15
        securityContext:
          privileged: true
          capabilities:
            add:
              - SYS_ADMIN
        command: ["/opt/script.sh"]
        args: ["ARGS"]
      restartPolicy: OnFailure
  backoffLimit: 1

Notice that this script is the same as “spades_job.yaml” above, with the exceptions that “NAME” and “ARGS” have been used to replace the actual job name and arguments.

We can write a batch job that iterates through all samples to be processed and uses the “sed” command to replace these placeholders with actual job and samle names:

make_job_yamls.sh

#!/bin/bash
num_tasks=`cat tasks | wc -l`
echo $num_tasks
for i in `seq 1 $num_tasks`
do
    sample=`head -n $i tasks | tail -n1`
    task_name=spades$i
    echo $sample
    echo $task_name
    sed "s/NAME/$task_name/g" spades_job.template.yaml > tmp
    sed "s/ARGS/$sample/g" tmp > job_$task_name.yaml
    rm tmp 
done

Where the tasks file is a list of sample names to analyze:

head tasks
keratinocytes-0.5d-rep1
keratinocytes-0.5d-rep2
keratinocytes-0h-rep1
keratinocytes-0h-rep2

This will produce a yaml file for each job with a unique job name.