Bioinformatics Notebook: November 2013

Tuesday, November 19, 2013

BEDTools Documentation

The BEDTools allow a fast and flexible way of comparing large datasets of genomic features. The BEDtools utilities allow one to address common genomics tasks such as finding feature overlaps and computing coverage. Its name is due to an historical reason because nowadays they can process the most commonly used feature file formats like: BED, GFF, VCF, and SAM. The following are examples of common questions that one can address with BEDTools:

Which SNPs are in a coding region?

Which are the exonic and intronic coverages?

How many positions have a coverage greater than 8?

Which SNPs are shared by two predictions done by two different SNP callers?

The following notes a partially taken from the BEDTools manual. It is important to read this manual before using BEDTools in a real question.

Summary of the tools

Some of the tools included are:

intersectBed: Returns overlapping features between two BED/GFF/VCF/BAM files.
windowBed: Returns overlapping features between two BED/GFF/VCF/BAM files within a “window”.
closestBed: Returns the closest feature to each entry in a BED/GFF/VCF file.
coverageBed: Summarizes the depth and breadth of coverage of features in one BED/GFF/BAM file (e.g., aligned reads) relative to another (e.g., user-defined windows).
genomeCoverageBed: Histogram or a “per base” report of genome coverage.
subtractBed: Removes the portion of an interval that is overlapped by another feature.
mergeBed: Merges overlapping features into a single feature.
fastaFromBed: Creates FASTA sequences from BED/GFF intervals.
maskFastaFromBed: Masks a FASTA file based upon BED/GFF coordinates.
overlap: Computes the amount of overlap (positive values) or distance (negative values) between genome features.; Examples are here.

Friday, November 15, 2013

GATK Unified Genotyper Parallel Running

How to generate command for running UG on a list of bam files chromosome wise so that we can reduce the time it takes to run on cluster.

Usual command.

Above command will generate a vcf file containing calls on chr1 on 120 samples.
following will generate commands for all the chromosomes:

After running on 1gb 1core node it ran out of memory and required to bump the memory up and it ran comfortably on 2gb 1 core nodes.

here is how to combine the vcf files
for i in `seq 1 22` X Y M; do cat UnifiedGenotyper.chr$i.vcf; done |grep -v "#" >vcfData

Add header
grep "#" UnifiedGenotyper.chr1.vcf >head; cat head vcfData.vcf>UnifiedGenotyper.vcf; rm -rf tmp head

Index it using igvtools and you are ready to go.

Tuesday, November 12, 2013

PBS Message in case of running out of memory

==================================================================================================
|| NOTE: this job was likely deleted by the batch system due to exceeding available memory. || ==================================================================================================

If you are not sure if the ran successfully or not, its always better to grep above string from the .o file to make sure it did not ran out of memory.

Friday, November 8, 2013

my .bashrc file

Must Have

# Files/ directory/ File Size

Found this on stackoverflow.
FAT32:

Maximum number of files: 268,435,437

Maximum file size: 4GB

maximum number of files per directory: up to 65535, or less depending on file names

NTFS:

Maximum number of files: 4,294,967,295

Maximum file size: 16TB currently (16EB theoretically)

Ext2:

Maximum number of files: 10¹⁸

Maximum file size: 2TB

theoretical file per directory limit: 1.3 × 10²⁰ files

Ext3:

Maximum number of files: number of bytes in volume/2¹³.

Maximum file size: 16GB (1KB block) to 2TB (4KB block)

Parallelism in Haplotype Caller GATK

Based on the scatter and gather concept:

if we are using Haplotype Caller for SNP Calling on Exome Data we can restrict the caller to look into a particular region using -L option. We also have a bed file containing the Regions we want to do calling, based on the two following line of code will produce ~300k command lines to submit on cluster which we can gather later to get the final vcf file.

awk 'BEGIN{OFS="\t"}; {print "java64 -Xmx4g -jar /usr/local/apps/GATK/GenomeAnalysisTK-2.4-9-g532efad/GenomeAnalysisTK.jar -T HaplotypeCaller -R /data/khanlab/ref/GATK/hg19/ucsc.hg19.fasta -minPruning 5 -I "$1".list --dbsnp /data/khanlab/ref/GATK/hg19/dbsnp_137.hg19.vcf -stand_call_conf 50.0 -stand_emit_conf 10.0 -L "$1":"$2"-"$3" -o "$1"_"$2"_"$3".vcf"}' MY.hg19.bed >CMD_FILE

I submitted this CMD_FILE to the cluster on nodes requesting 8gb memory per job, using following command:

swarm -f CMD_FILE -q ccr --singleout --jobarray -g 4 -N HapCall -b 200

it took ~80 hrs to finish all the jobs, and if we bundle it to less then 200 and can get more nodes we can even finish it as earlier as within 24 hrs. If you run this job 1 chromosome at a time acquiring 25 nodes, chr1 is going to take around 500 hrs. which is unacceptable in this era.

One last thing we have to do once this job finishes is to concatenate the vcf files in the same order as the bed regions appear in the bed file.

awk 'BEGIN{OFS="\t"}; {printf("%s ", $1"_"$2"_"$3".vcf");}' SeqCap_EZ_Exome_v3_capture_trimed.hg19.bed >CMD_FILE

above command will generate a " " separated list of all the vcf files we created. open the CMD_FILE write cat at the beginning of it and run it. add the header from any of the vcf files to the output and you have your vcf file ready for vqsr or any post process.

Thursday, November 7, 2013

One Liners

Numerical sort using Perl.

Print HASH

Count number of columns in a file

Bioinformatics Notebook