Bioinformatics Notebook: 2013

Thursday, December 19, 2013

Browser based Shell connection

Most of the time I need to login to the server to submit the jobs or check the status of the running job and for programming, at my desk I have MTPPuTTY for this purpose. But what if you are at vacation and want a quick look at the jobs you are running or submit a new job. I started googling and found what I was looking for "A web browser based secure shell connection extension for Google chrome".

Install it and give it a shot, its pretty impressive. you can customize it (tools-> Extensions->options in secure shell extension).

Monday, December 2, 2013

File checking in Perl

A list of checking you can perform on a file in perl
Commanly used checks are in Bold Font
-r File is readable by effective uid/gid.
-w File is writable by effective uid/gid.
-x File is executable by effective uid/gid.
-o File is owned by effective uid.
-R File is readable by real uid/gid.
-W File is writable by real uid/gid.
-X File is executable by real uid/gid.
-O File is owned by real uid.
-e File exists.
-z File has zero size (is empty).
-s File has nonzero size (returns size in bytes).
-f File is a plain file.
-d File is a directory.
-l File is a symbolic link.
-p File is a named pipe (FIFO), or Filehandle is a pipe.
-S File is a socket.
-b File is a block special file.
-c File is a character special file.
-t Filehandle is opened to a tty.
-u File has setuid bit set.
-g File has setgid bit set.
-k File has sticky bit set.
-T File is an ASCII text file (heuristic guess).
-B File is a "binary" file (opposite of -T).
-M Script start time minus file modification time, in days.
-A Same for access time.
-C Same for inode change time (Unix, may differ for other platforms)

Split Bam files

In GATK Local Realignment process its recommended that we use as many bams as possible so that the samples help each other during the process. At the same time MuTect requires 2 separate bam files for blood and tumor. This dilemma requires to split the Bam file generated after the LR/BQSR process and generate the Sample Level Bam files.

Following python code takes a bam file and split it based on the sample Read ID and index the newly generated bam files.

To denote that the new bam files are already Locally Realigned and Quality Recalibrated, I added "BQSR.LR" in the names of new bams I am generating in, which can be removed/replaced if you dont need it.

Tuesday, November 19, 2013

BEDTools Documentation

The BEDTools allow a fast and flexible way of comparing large datasets of genomic features. The BEDtools utilities allow one to address common genomics tasks such as finding feature overlaps and computing coverage. Its name is due to an historical reason because nowadays they can process the most commonly used feature file formats like: BED, GFF, VCF, and SAM. The following are examples of common questions that one can address with BEDTools:

Which SNPs are in a coding region?

Which are the exonic and intronic coverages?

How many positions have a coverage greater than 8?

Which SNPs are shared by two predictions done by two different SNP callers?

The following notes a partially taken from the BEDTools manual. It is important to read this manual before using BEDTools in a real question.

Summary of the tools

Some of the tools included are:

intersectBed: Returns overlapping features between two BED/GFF/VCF/BAM files.
windowBed: Returns overlapping features between two BED/GFF/VCF/BAM files within a “window”.
closestBed: Returns the closest feature to each entry in a BED/GFF/VCF file.
coverageBed: Summarizes the depth and breadth of coverage of features in one BED/GFF/BAM file (e.g., aligned reads) relative to another (e.g., user-defined windows).
genomeCoverageBed: Histogram or a “per base” report of genome coverage.
subtractBed: Removes the portion of an interval that is overlapped by another feature.
mergeBed: Merges overlapping features into a single feature.
fastaFromBed: Creates FASTA sequences from BED/GFF intervals.
maskFastaFromBed: Masks a FASTA file based upon BED/GFF coordinates.
overlap: Computes the amount of overlap (positive values) or distance (negative values) between genome features.; Examples are here.

Friday, November 15, 2013

GATK Unified Genotyper Parallel Running

How to generate command for running UG on a list of bam files chromosome wise so that we can reduce the time it takes to run on cluster.

Usual command.

Above command will generate a vcf file containing calls on chr1 on 120 samples.
following will generate commands for all the chromosomes:

After running on 1gb 1core node it ran out of memory and required to bump the memory up and it ran comfortably on 2gb 1 core nodes.

here is how to combine the vcf files
for i in `seq 1 22` X Y M; do cat UnifiedGenotyper.chr$i.vcf; done |grep -v "#" >vcfData

Add header
grep "#" UnifiedGenotyper.chr1.vcf >head; cat head vcfData.vcf>UnifiedGenotyper.vcf; rm -rf tmp head

Index it using igvtools and you are ready to go.

Tuesday, November 12, 2013

PBS Message in case of running out of memory

==================================================================================================
|| NOTE: this job was likely deleted by the batch system due to exceeding available memory. || ==================================================================================================

If you are not sure if the ran successfully or not, its always better to grep above string from the .o file to make sure it did not ran out of memory.

Friday, November 8, 2013

my .bashrc file

Must Have

# Files/ directory/ File Size

Found this on stackoverflow.
FAT32:

Maximum number of files: 268,435,437

Maximum file size: 4GB

maximum number of files per directory: up to 65535, or less depending on file names

NTFS:

Maximum number of files: 4,294,967,295

Maximum file size: 16TB currently (16EB theoretically)

Ext2:

Maximum number of files: 10¹⁸

Maximum file size: 2TB

theoretical file per directory limit: 1.3 × 10²⁰ files

Ext3:

Maximum number of files: number of bytes in volume/2¹³.

Maximum file size: 16GB (1KB block) to 2TB (4KB block)

Parallelism in Haplotype Caller GATK

Based on the scatter and gather concept:

if we are using Haplotype Caller for SNP Calling on Exome Data we can restrict the caller to look into a particular region using -L option. We also have a bed file containing the Regions we want to do calling, based on the two following line of code will produce ~300k command lines to submit on cluster which we can gather later to get the final vcf file.

awk 'BEGIN{OFS="\t"}; {print "java64 -Xmx4g -jar /usr/local/apps/GATK/GenomeAnalysisTK-2.4-9-g532efad/GenomeAnalysisTK.jar -T HaplotypeCaller -R /data/khanlab/ref/GATK/hg19/ucsc.hg19.fasta -minPruning 5 -I "$1".list --dbsnp /data/khanlab/ref/GATK/hg19/dbsnp_137.hg19.vcf -stand_call_conf 50.0 -stand_emit_conf 10.0 -L "$1":"$2"-"$3" -o "$1"_"$2"_"$3".vcf"}' MY.hg19.bed >CMD_FILE

I submitted this CMD_FILE to the cluster on nodes requesting 8gb memory per job, using following command:

swarm -f CMD_FILE -q ccr --singleout --jobarray -g 4 -N HapCall -b 200

it took ~80 hrs to finish all the jobs, and if we bundle it to less then 200 and can get more nodes we can even finish it as earlier as within 24 hrs. If you run this job 1 chromosome at a time acquiring 25 nodes, chr1 is going to take around 500 hrs. which is unacceptable in this era.

One last thing we have to do once this job finishes is to concatenate the vcf files in the same order as the bed regions appear in the bed file.

awk 'BEGIN{OFS="\t"}; {printf("%s ", $1"_"$2"_"$3".vcf");}' SeqCap_EZ_Exome_v3_capture_trimed.hg19.bed >CMD_FILE

above command will generate a " " separated list of all the vcf files we created. open the CMD_FILE write cat at the beginning of it and run it. add the header from any of the vcf files to the output and you have your vcf file ready for vqsr or any post process.

Thursday, November 7, 2013

One Liners

Numerical sort using Perl.

Print HASH

Count number of columns in a file

Bioinformatics Notebook