scAVENGERS pipeline: running the whole pipeline for demultiplexing scATAC-seq data

scAVENGERS pipeline is a wrapper script that runs variant calling, allele count matrix generation, and demultiplexing at once. The pipeline is implemented in snakemake (https://snakemake.github.io/), which is a workflow management system. The pipeline is executed by running the command below.

Usage

To run scAVENGERS pipeline, you must provide path of the configuration file and number of jobs to use as --configfile and --jobs options respectively.

scAVENGERS pipeline --configfile config.yaml -j $THREADS

The final result is given as a tab-seperated file clusters.tsv in the output directory. The structure of the result is in the table below.

column name

description

barcode

barcode sequence

status

status of the barcode sequence: singlet or doublet

assignment

cluster number where the barcode is assigned. Doublets are expressed in a form {n}/{n}.

log_prob_singleton

log singleton probability calculated by troublet

log_prob_doublet

log doublet probabilty calculated by troublet

cluster{n}

log likelihood of the assignment on cluster n.

Configuration file

Through configuration file, you can set parameters for each step. The configuration below is provided in config.yaml in the scAVENGERS repository.

# number of threads to use
# to make this valid, you must set the number of cores when running snakemake pipeline.
# e.g. -j 10
THREADS: 10

# input and output data path
DATA:
  # path to directory for saving output files
  out_dir:
    outdir
  # path to genome sequence file. SHOULD NOT be compressed.
  genome:
    genome.fa
  # path to alignment file
  alignment:
    alignment.bam
  # path to text file containing line-seperated list of barcodes
  barcode:
    barcodes.txt

# cluster.py settings
CLUSTER:
  # number of genotypes in multiplexed sample
  n_genotypes: 10
  # ploidy of the organism of interest
  ploidy: 2
  # error rate correction parameter of probability model
  err_rate: 0.001
  # difference of likelihood to stop the iteration
  stop_criterion: 0.1
  # maximum number of iteration
  max_iter: 1000

# variant caller settings
VARIANT_CALLER:
  # variant caller to use. Choose "freebayes" or "strelka."
  caller: "strelka"
  # minimum quality score of variant, used only if the caller is "freebayes"
  min_qual: 0
  # whether to use low-GQX variants only, used only if the caller is "strelka"
  lowgqx: False

# vartrix settings
VARTRIX:
  # minimum mapping quality of read to use in vartrix
  mapq: 30

# troublet settings
TROUBLET:
  # doublet prior
  doublet_prior: 0.5
  # doublet posterior threshold
  doublet_threshold: 0.9
  # singlet posterior threshold
  singlet_threshold: 0.9