Genomic pipelines in Kundaje lab

Managing multiple pipelines

./utils/bds_scr is a BASH script to create a detached screen for a BDS script and redirect stdout/stderr to a log file [LOG_FILE_NAME]. If a log file already exists, stdout/stderr will be appended to it. Monitor a pipeline with tail -f [LOG_FILE_NAME].

The only difference between bds_scr and bds is that you have [SCR_NAME] [LOG_FILE_NAME] between bds command and its parameters (or a BDS script name). You can skip [LOG_FILE_NAME] then a log file [SCR_NAME].log will be generated on the working directory. You can also add any BDS parameters (like -dryRun, -d and -s). The following example is for running a pipeline on Sun Grid Engine. Use bds_scr_5min instead of bds_scr to prevent from running multiple pipelines on the same data set and output directory. bds_scr_5min does not start a screen if the last modified time of a log file is fresh (5 minutes).

$ bds_scr [SCR_NAME] [LOG_FILE_NAME] [PIPELINE.BDS] ...
$ bds_scr [SCR_NAME] [PIPELINE.BDS] ...
$ bds_scr [SCR_NAME] [LOG_FILE_NAME] -s sge [PIPELINE.BDS] ...
$ bds_scr_5min [SCR_NAME] [PIPELINE.BDS] ...

Once the pipeline run is done, the screen will be automatically closed. To kill a pipeline manually while it’s running:

$ kill_scr [SCR_NAME]
$ screen -X -S [SCR_NAME] quit

Specifying a cluster engine

You can run BDS pipeline with a specified cluster engine. Choose your cluster system (local: UNIX threads, sge: Sun Grid Engine, slurm: SLURM …).

$ bds -s [SYSTEM] [PIPELINE.BDS] ...

Modify $HOME/.bds/bds.config or ./default.env to change your default system. The following example is to use Sun Grid Engine (sge) as your default system. Then you no longer need to add -s sge to the command line.

#system = local
system = sge
#system = slurm

You need additional modification on bds.config to correctly configure your cluster engine. Read more at here.

For Kundaje lab clusters, SCG and Sherlock clusters, it’s already set up for Sun Grid Engine and SLURM. SLURM is implemented with using a generic cluster (system = generic) for AQUAS pipeline modules.

Resource settings

Most clusters have resource limitation so that jobs submitted without it will be declined. By default, walltime is 23 hours and max memory is 12GB. To change them, add the following parameters to the command line. -mem does not apply to jobs with their own max. memory parameters (eg. -mem_spp for spp, -mem_bwa for bwa, …)

-wt [WALLTIME; examples: 5:50:00, 10h20m, 7200] -memory [MAX_MEMORY; examples: 5G, 2000K]

You can specify walltime and max. memory for a specific job (with -mem_[APP_NAME] [MAX_MEM]). To see which job has specific resource settings, run the pipeline without parameters $ bds [PIPELINE_BDS] then it will display all parameters including resource settings and help. The following line is an example parameter to increase walltime and max. memory for MACS2 peak calling.

-wt_macs2 10h30m -mem_macs2 15G

Note that max. memory defined with -mem_XXX is NOT PER CPU! If your system (either local or cluster engine) doesn’t limit walltime and max. memory for jobs, add the following to the command line. Pipeline jobs will run without resource restriction.

-unlimited_mem_wt

Debugging BDS pipelines

Take a look at HTML report (which contains all STDERR/STDOUT for all jobs in the pipeline; ). It tells you everything about all pipeline jobs. Find which stage is errorneous. Carefully look at system messages (STDERR and STDOUT) for it. BDS HTML report is located at the working folder with name [PIPELINE_NAME]_[TIMESTAMP]_report.html. This report is automatically generated by BDS.
Correct errors.

2.1. Lack of memory: increase memory for all jobs (e.g. add -mem 20G) or a specific problematic job (e.g. add -mem_macs2 20G).

2.2. Timeout: increase walltime for all jobs (e.g. add -wt 24h) or a specific long job (e.g. add -wt_macs2 200h). (Warning! Most clusters have limit for walltime. Make it as shortest as you can to get your queued jobs executed quickly.)

2.3. Wrong input: check all input files are available.

2.4. Software error: use recommended software versions.
Resume pipeline with the same command line that you used for starting it. Previous successful stages will be automatically skipped.

# make BDS verbose
$ bds -v [PIPELINE.BDS] ...

# display debugging information
$ bds -d [PIPELINE.BDS] ...

# test run (this actually does nothing) to check input/output file names and commands
$ bds -dryRun [PIPELINE.BDS] ...

Species file

There are many species specific parameters like indices (bwa, bowtie, …), chromosome sizes and sequence files (chr*.fa). If you have multiple pipelines, it’s inconvenient to individually define all parameters in a command line argument for each pipeline run. However, if you have a species file with all species specific parameters defined, then you define less parameters in the command line and share the species file with all other pipelines.

Add the following to the command line to specify species and species file.

-species [SPECIES; hg19, mm9, ...] -species_file [SPECIES_FILE]

You can override any parameters defined in the species file by adding them to command line argument or configuration file. For example, if you want to override parameters for BWA index and umap:

-species hg19 -species_file my_species.conf -bwa_idx [YOUR_OWN_BWA_IDX] -chrsz [YOUR_OWN_CHR_SIZES_FILE]

Example species file looks like the following. You can define your own species.

[hg19]
chrsz   = /mnt/data/annotations/by_release/hg19.GRCh37/hg19.chrom.sizes # chromosome sizes
seq     = /mnt/data/ENCODE/sequence/encodeHg19Male 			# genome reference sequence
gensz   = hs 								# genome size: hs for humna, mm for mouse
umap    = /mnt/data/ENCODE/umap/encodeHg19Male/globalmap_k1tok1000 	# uniq. mappability tracks
bwa_idx = /mnt/data/annotations/indexes/bwa_indexes/encodeHg19Male/v0.7.10/encodeHg19Male_bwa-0.7.10.fa
blacklist = /mnt/data/ENCODE/blacklists/wgEncodeDacMapabilityConsensusExcludable.bed.gz

# added for for atac-seq pipeline (atac_dnase_pipelines/atac.bds)
bwt2_idx = /mnt/data/annotations/indexes/bowtie2_indexes/bowtie2/ENCODEHg19_male

# added for ATAQC
tss_enrich = /mnt/lab_data/kundaje/users/dskim89/ataqc/annotations/hg19/hg19_RefSeq_stranded.bed.gz
ref_fa  = /mnt/lab_data/kundaje/users/dskim89/ataqc/annotations/hg19/encodeHg19Male.fa  // genome reference fasta
dnase = /mnt/lab_data/kundaje/users/dskim89/ataqc/annotations/hg19/reg2map_honeybadger2_dnase_all_p10_ucsc.bed.gz
prom = /mnt/lab_data/kundaje/users/dskim89/ataqc/annotations/hg19/reg2map_honeybadger2_dnase_prom_p2.bed.gz
enh = /mnt/lab_data/kundaje/users/dskim89/ataqc/annotations/hg19/reg2map_honeybadger2_dnase_enh_p2.bed.gz
reg2map = /mnt/lab_data/kundaje/users/dskim89/ataqc/annotations/hg19/dnase_avgs_reg2map_p10_merged_named.pvals.gz
roadmap_meta = /mnt/lab_data/kundaje/users/dskim89/ataqc/annotations/hg19/eid_to_mnemonic.txt

# your own definition for species

[hg19_custom]
chrsz   = ...
seq     = ...
...

[mm9]
...

[mm10]
...

Description for parameters in a species file.

chrsz               : Chromosome sizes file path (use fetchChromSizes from UCSC tools).
seq                 : Reference genome sequence directory path (where chr*.fa exist).
gensz               : Genome size; hs for human, mm for mouse (default: hs).
umap                : Unique mappability tracks directory path.
bwa_idx             : BWA index (full path prefix of [].bwt file) .

# for atac-seq pipeline (atac.bds)
bwt2_idx            : Bowtie2 index (full path prefix of [].1.bt2 file).

# for ATAQC
tss_enrich          : TSS enrichment bed.
ref_fa 		    : Reference genome sequence fasta.
blacklist 	    : Blacklist bed for ataqc.
dnase 		    : DNase bed for ataqc.
prom 		    : Promoter bed for ataqc.
enh 		    : Enhancer bed for ataqc.
reg2map 	    : Reg2map for ataqc.
roadmap_meta 	    : Roadmap metadata for ataqc.

Unique mappability tracks areavaiable here. Blacklists are available here.

$ bds [PIPELINE_BDS] -species [SPECIES; hg19, mm9, ...] -species_file [SPECIES_FILE]

If you want to skip -species_file parameter, define it in the default environment file ./default.env.

[your_hostname] # get it with 'hostname -f'

conda_env     = [CONDA_ENV_NAME; bds_atac for atac, aquas_chipseq for chipseq]
conda_env_py3 = [CONDA_ENV_PY3_NAME; bds_atac_py3 for atac, aquas_chipseq_py3 for chipseq]

species_file = [SPECIES_FILE]

Setting up shell environment

Ignore this section if you have installed dependencies with ./install_dependencies.sh.

It is important to define enviroment variables (like $PATH) to make bioinformatics softwares in the pipeline work properly. mod, shcmd and addpath are three convenient ways to define environment variables. Environment variables defined with mod, shcmd and addpath are preloaded for all tasks on the pipeline. For example, if you define environment variables for bwa/0.7.3 with mod. bwa of version 0.7.3 will be used throughout the whole pipeline (including bwa aln, bwa same and bwa sampe).

1) mod

There are different versions of bioinformatics softwares (eg. samtools, bedtools and bwa) and Enviroment Modules is the best way to manage environemt variables for them. For example, if you want to add environment variables for bwa 0.7.3 by using Environment Modules. You can simply type $ module add bwa/0.7.3;. The equivalent setting in the pipeline configuration file should look like:

   mod= bwa/0.7.3;

You can have multiple lines for mod since any suffix is allowed. Use ` ` as a delimiter.

   mod_bio= bwa/0.7.3 bedtools/2.x.x samtools/1.2
   mod_lang= r/3.2.2 java/latest

2) shcmd

If you have softwares locally installed on your home, you may need to add to them environment variables like $PATH, $LD_LIBRARY_PATH and so on. IMPORTANT! Note that any pre-defined enviroment variables (like $PATH) should be referred in a curly bracket like ${PATH}. This is because BDS distinguishes environment variables from BDS variables by a curly bracket ${}.

   shcmd= export PATH=${PATH}:path_to_your_program

You can have multiple lines for shcmd since any suffix is allowed. Use ; as a delimiter.

   shcmd_R= export PATH=${PATH}:/home/userid/R-3.2.2;
   shcmd_lib= export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:${HOME}/R-3.2.2/lib

shcmd is not just for adding environemt variables. It can execute any bash shell commands prior to any jobs on the pipeline. For example, to give all jobs peaceful 10 seconds before running.

   shcmd_SLEEP_TEN_SECS_FOR_ALL_JOBS= echo "I am sleeping..."; sleep 10

3) addpath

If you just want to add something to your $PATH, use addpath instead of shcmd. It’s much simpler. Use : or ; as a delimiter.

   addpath= ${HOME}/program1/bin:${HOME}/program1/bin:${HOME}/program2/bin:/usr/bin/test

4) conda_env and conda_env_py3

You can also use Anaconda virtual environment in the pipeline. BDS pipelines usually take two conda environments since there is a conflict between softwares based on python3 and python2. Make sure that the environment corresponding to conda_env has python2 installed and that corresponding to conda_env_py3 has python3 installed.

   conda_env= [CONDA_ENV_NAME]		# python2 must be installed in this virt. env.
   conda_env_py3= [CONDA_ENV_NAME_FOR_PY3] 	# python3 must be installed in this virt. env.

-mod, -shcmd and -addpath are command line argument versions of mod, shcmd and addpath. However NO SUFFIX is allowed. For example,

$ bds [PIPELINE_BDS] -mod 'bwa/0.7.3; samtools/1.2' -shcmd 'export PATH=${PATH}:/home/userid/R-3.2.2' -addpath '${HOME}/program1/bin' -conda_env my_env -conda_env_py3 my_env_py3

Environment file

It should be more convenient to have a separate file to define your own shell environments and cluster resources per hostname. You can also define any parameters (like bwa index, # thread for tasks, fastqs and so on) in the environment file. If an environment file is not specified ./default.env will be used by default.

$ bds [PIPELINE_BDS] ... -env [ENV_FILE]

Shell environment settings can be set up per HOSTNAME ($ hostname -f). This means that you can have multiple environment configruation for all clusters in one environment file. Single or multiple hostnames are written in a square bracket ([SECTION_NAME]). You can also use a group for hostnames. You can use a wild card (only one asterisk!) in hostnames. Example sturcture is like the following:

[hostname1]
... parameters ...

[hostname2]
... parameters ...

[hostname3, hostname4]
... parameters ...


[hostname5 : group1]
[hostname6, hostname7: group2]

[group1]
... parameters ...

[group2]
... parameters ...

Example environment file is like the following. Take a look at ./default.env.

[sherlock*.stanford.edu, sh-*.local] 	# your hostname

mod_any_suffix = bwa/0.7.3 samtools/1.2
addpath_any_suffix = ${HOME}/program1/bin
shcmd_any_suffix = export R_PATH=/home/userid/R-3.2.2

species_file = /path/to/your/species.conf

mem_spp = 4G
wt_spp  = 10:00:00
nice 	= 19 		# sub tasks will have low priority (nice==19).
system 	= slurm 	# SLURM

conda_env = my_conda_env_py2
conda_env_py3 = my_conda_env_py3
...

[other_host_name_or_group]
...

Parameters can be defined in 1) environment file, 2) configuation file and 3) command line arguments. They will be overriden in the order of 1) < 2) < 3).

Setting up Sun Grid Engine

Add the following to grid engine configuration.

$ sudo qconf -mconf
...
execd_params                 ENABLE_ADDGRP_KILL=true
...

Add a parallel environment shm to grid engine configuration. If you already have your own parallel environment, skip this.

$ sudo qconf -ap

pe_name            shm
slots              999
user_lists         NONE
xuser_lists        NONE
start_proc_args    /bin/true
stop_proc_args     /bin/true
allocation_rule    $pe_slots
control_slaves     FALSE
job_is_first_task  FALSE
urgency_slots      min
accounting_summary FALSE

Add your parallel environment (shm by default) to your queue and set your shell as bash.

$ sudo qconf -mq [YOUR_MAIN_QUEUE]
...
pe_list               make [YOUR_PE_NAME]
shell                 /bin/bash
...

Correctly configure ./bds.config and copy it to $HOME/.bds/:

sge.pe = [YOUR_PE_NAME] # shm by default
sge.mem = [MAX_MEMORY_TYPE] # h_vmem by default
sge.timeout = [HARD_WALLTIME_TYPE] # h_rt by default
sge.timeout2 = [SOFT_WALLTIME_TYPE] # s_rt by default

More information is at here.

BASH completion for UNIX screens

For automatic BASH completion for screens (http://www.commandlinefu.com/commands/view/12160/bash-auto-complete-your-screen-sessions), add the following to your $HOME/.bashrc:

complete -C "perl -e '@w=split(/ /,\$ENV{COMP_LINE},-1);\$w=pop(@w);for(qx(screen -ls)){print qq/\$1\n/ if (/^\s*\$w/&&/(\d+\.\w+)/||/\d+\.(\$w\w*)/)}'" screen