LeafCutterMD has two main components:
junc
files (which can be obtained easily from .bam
files)R
code to
Note, for illustrative purposes (and reduced example file sizes) we use the same data as in the differential splicing example, but this data is not a rare disease cohort and thus the final results generated in this vignette are not expected to be indicative of disease causality.
Of course the real “Step 0” is running QC on your RNA-seq samples using e.g. MultiQC. Assuming they look good you need to align reads. For the analysis in the LeafCutter paper we used either OLego, which is designed to be particularly sensitive for finding de novo junctions, or STAR, which is fast as long as you have enough RAM.
For OLego
we used the command
olego -j hg19.intron.hmr.brainmicro.bed -e 6 hg19.fa
where -j
provides a custom junction file, and -e
specifies the required number of nt the read must extend into the exon to be quantified. For more details on the junction file we used see Li et al. 2005.
The STAR
index was generated as
STAR --runMode genomeGenerate --genomeDir hg19index/ --genomeFastaFiles hg19.fa --sjdbGTFfile gencode_v19.gtf --sjdbOverhang 100
(alternatively use one of the prebuilt indices ) and alignment itself was run (with STAR v2.5.2a) as
STAR --genomeDir hg19index/ --twopassMode --outSAMstrandField intronMotif --readFilesCommand zcat --outSAMtype BAM
As of STAR v2.5.3a you may need to do
STAR --genomeDir hg19index/ --twopassMode Basic --outSAMstrandField intronMotif --readFilesCommand zcat --outSAMtype BAM Unsorted
We chose 6nt as the default overhang required by LeafCutter. By chance we would expect one match every 46bp, or 4096bp, which appears to be quite likely for any given intron. However, RNA-seq mappers already deal with this problem by 1) assuring that the junction has already been previously annotated or is supported by reads with longer overhang (e.g. in STAR two-pass mode) 2) penalizing non-canonical junctions (i.e. non GT-AG junctions). The effect of the latter is that we would only expect one match every 48bp, or 65,536bp (just one or two every 100kb, the max size allowed for our introns). However, our most restrictive filter is the requirement that reads considered be uniquely mapped. Therefore, even when the overhang is just 6bp, there is no ambiguity in mapping. Moreover, junctions are rarely only supported by reads that have an overhang of 6, when the size of the overhang goes up to 7, 8, or 9nt, the probability that we see these by chance goes down to one in over 4 million bp (for 9nt).
bam
s to junc
sWe provide a helper script scripts/bam2junc.sh
to (you guessed it) convert bam
files to junc
files. This step uses the CIGAR strings in the bam
to quantify the usage of each intron. LeafCutter considers a read “problematic” if its mapped cigar string does not follow the pattern ‘xMyNzM’.
example_data/worked_example.sh
gives you an example of how to do this in batch, assuming your data is in example_geuvadis/
for bamfile in `ls example_geuvadis/*.bam`
do
echo Converting $bamfile to $bamfile.junc
sh ../scripts/bam2junc.sh $bamfile $bamfile.junc
echo $bamfile.junc >> test_juncfiles.txt
done
This step is pretty fast (e.g. a couple of minutes per bam) but if you have samples numbering in the 100s you might want to do this on a cluster. Note that we also make a list of the generated junc
files in test_juncfiles.txt
.
Next we need to define intron clusters using the leafcutter_cluster.py
script. For example:
python ../clustering/leafcutter_cluster.py -j test_juncfiles.txt -m 50 -o testYRIvsEU -l 500000
This will cluster together the introns fond in the junc
files listed in test_juncfiles.txt
, requiring 50 split reads supporting each cluster and allowing introns of up to 500kb. The prefix testYRIvsEU
means the output will be called testYRIvsEU_perind_numers.counts.gz
(perind meaning these are the per individual counts).
You can quickly check what’s in that file with
zcat testYRIvsEU_perind_numers.counts.gz | more
which should look something like this:
RNA.NA06986_CEU.chr1.bam RNA.NA06994_CEU.chr1.bam RNA.NA18486_YRI.chr1.bam RNA.NA06985_CEU.chr1.bam RNA.NA18487_YRI.chr1.bam RNA.NA06989_CEU.chr1.bam RNA.NA06984_CEU.chr1.bam RNA.NA18488_YRI.chr1.bam RNA.NA18489_YRI.chr1.bam RNA.NA18498_YRI.chr1.bam
chr1:17055:17233:clu_1 21 13 18 20 17 12 11 8 15 25
chr1:17055:17606:clu_1 4 11 12 7 2 0 5 2 4 4
chr1:17368:17606:clu_1 127 132 128 55 93 90 68 43 112 137
chr1:668593:668687:clu_2 3 11 1 3 4 4 8 1 5 16
chr1:668593:672093:clu_2 11 16 23 10 3 20 9 6 23 31
Each column corresponds to a different sample (original bam file) and each row to an intron, which are identified as chromosome:intron_start:intron_end:cluster_id.
We can now use our intron count file to do outlier splicing analysis (this assumes you have successfully installed the leafcutter
R package as described under Installation above)
../scripts/leafcutterMD.R --num_threads 8 ../example_data/testYRIvsEU_perind_numers.counts.gz
Running ../scripts/leafcutterMD.R -h
will give usage info for this script.
Three tab-separated text files are output:
leafcutter_outlier_pVals.txt
. This file has introns as rows and samples as columns with entries that are p
-values (unadjusted) for there being outlier intron excision in the corresponding sample/intron.
leafcutter_outlier_clusterPvals.txt
. This file has clusters as rows and samples as columns with entries that are p
-values (unadjusted) for there being outlier intron excision in the corresponding sample/cluster.
leafcutter_outlier_cluster_effSize.txt
. This file has introns as rows and samples as columns with entries that are the effect sizes (i.e., estimated difference in fractional usage of the intron compared to the average in the population) in the corresponding sample/intron.