Based on the scatter and gather concept:
if we are using Haplotype Caller for SNP Calling on Exome Data we can restrict the caller to look into a particular region using -L option. We also have a bed file containing the Regions we want to do calling, based on the two following line of code will produce ~300k command lines to submit on cluster which we can gather later to get the final vcf file.
awk 'BEGIN{OFS="\t"}; {print "java64 -Xmx4g -jar /usr/local/apps/GATK/GenomeAnalysisTK-2.4-9-g532efad/GenomeAnalysisTK.jar -T HaplotypeCaller -R /data/khanlab/ref/GATK/hg19/ucsc.hg19.fasta -minPruning 5 -I "$1".list --dbsnp /data/khanlab/ref/GATK/hg19/dbsnp_137.hg19.vcf -stand_call_conf 50.0 -stand_emit_conf 10.0 -L "$1":"$2"-"$3" -o "$1"_"$2"_"$3".vcf"}' MY.hg19.bed >CMD_FILE
I submitted this CMD_FILE to the cluster on nodes requesting 8gb memory per job, using following command:
swarm -f CMD_FILE -q ccr --singleout --jobarray -g 4 -N HapCall -b 200
it took ~80 hrs to finish all the jobs, and if we bundle it to less then 200 and can get more nodes we can even finish it as earlier as within 24 hrs. If you run this job 1 chromosome at a time acquiring 25 nodes, chr1 is going to take around 500 hrs. which is unacceptable in this era.
One last thing we have to do once this job finishes is to concatenate the vcf files in the same order as the bed regions appear in the bed file.
awk 'BEGIN{OFS="\t"}; {printf("%s ", $1"_"$2"_"$3".vcf");}' SeqCap_EZ_Exome_v3_capture_trimed.hg19.bed >CMD_FILE
above command will generate a " " separated list of all the vcf files we created. open the CMD_FILE write cat at the beginning of it and run it. add the header from any of the vcf files to the output and you have your vcf file ready for vqsr or any post process.