DECA User Guide

Introduction

DECA is a copy number variant caller built on top of Apache Spark to allow rapid variant calling on cluster/cloud computing environments. DECA is built on ADAM’s APIs, and is a reimplementation of the XHMM copy number variant caller. DECA provides an order of magnitude performance improvement over XHMM when running on a single machine. When running on a 1,024 core cluster, DECA can call copy number variants from the 1,000 Genomes exome reads in approximately 5 hours. DECA is highly concordant with XHMM, with >93% exact breakpoint concordance, and <0.01% discordant CNV calls.

Running DECA

DECA is run through the deca-submit command line:

./bin/deca-submit
Using SPARK_SUBMIT=/usr/local/bin/spark-2.2.1-bin-hadoop2.7/bin/spark-submit

Usage: deca-submit [<spark-args> --] <deca-args> [-version]

Choose one of the following commands:

normalize : Normalize XHMM read-depth matrix
coverage : Generate XHMM read depth matrix from read data
discover : Call CNVs from normalized read matrix
normalize_and_discover : Normalize XHMM read-depth matrix and discover CNVs
cnv : Discover CNVs from raw read data

The deca-submit script follows the same conventions as the adam-submit command line, whose documentation can be found here. As a result, just like ADAM, DECA can be deployed on a local machine, on AWS, an in-house cluster running YARN or SLURM, or using Toil. We provide a Toil workflow for running DECA as part of the bdgenomics.workflows package. bdgenomics.workflows can be installed with pip.