scAVENGERS pipeline: running the whole pipeline for demultiplexing scATAC-seq data
scAVENGERS pipeline is a wrapper script that runs variant calling, allele count matrix generation, and demultiplexing at once. The pipeline is implemented in snakemake (https://snakemake.github.io/), which is a workflow management system. The pipeline is executed by running the command below.
Usage
To run scAVENGERS pipeline, you must provide path of the configuration file and number of jobs to use as --configfile and --jobs options respectively.
scAVENGERS pipeline --configfile config.yaml -j $THREADS
The final result is given as a tab-seperated file clusters.tsv in the output directory. The structure of the result is in the table below.
column name |
description |
|---|---|
barcode |
barcode sequence |
status |
status of the barcode sequence: singlet or doublet |
assignment |
cluster number where the barcode is assigned. Doublets are expressed in a form {n}/{n}. |
log_prob_singleton |
log singleton probability calculated by troublet |
log_prob_doublet |
log doublet probabilty calculated by troublet |
cluster{n} |
log likelihood of the assignment on cluster n. |
Configuration file
Through configuration file, you can set parameters for each step. The configuration below is provided in config.yaml in the scAVENGERS repository.
# number of threads to use
# to make this valid, you must set the number of cores when running snakemake pipeline.
# e.g. -j 10
THREADS: 10
# input and output data path
DATA:
# path to directory for saving output files
out_dir:
outdir
# path to genome sequence file. SHOULD NOT be compressed.
genome:
genome.fa
# path to alignment file
alignment:
alignment.bam
# path to text file containing line-seperated list of barcodes
barcode:
barcodes.txt
# cluster.py settings
CLUSTER:
# number of genotypes in multiplexed sample
n_genotypes: 10
# ploidy of the organism of interest
ploidy: 2
# error rate correction parameter of probability model
err_rate: 0.001
# difference of likelihood to stop the iteration
stop_criterion: 0.1
# maximum number of iteration
max_iter: 1000
# variant caller settings
VARIANT_CALLER:
# variant caller to use. Choose "freebayes" or "strelka."
caller: "strelka"
# minimum quality score of variant, used only if the caller is "freebayes"
min_qual: 0
# whether to use low-GQX variants only, used only if the caller is "strelka"
lowgqx: False
# vartrix settings
VARTRIX:
# minimum mapping quality of read to use in vartrix
mapq: 30
# troublet settings
TROUBLET:
# doublet prior
doublet_prior: 0.5
# doublet posterior threshold
doublet_threshold: 0.9
# singlet posterior threshold
singlet_threshold: 0.9