scAVENGERS cluster: demultiplexing scATAC-seq data

Given reference and alternate allele count matrices generated by vartrix in mtx format, scAVENGERS cluster demultiplexes cell barcodes into each donor in unsupervised manner.

Usage

scAVENGERS cluster provides cluster assignment results in a tab-seperated format.

Unless -o option is not specified, the results are written into stdout. So, to save demultiplexing results to certain file, you may either run:

# Strategy 1: specifying -o option
scAVENGERS cluster -r ref.mtx -a alt.mtx -b barcodes.txt -o clusters_tmp.tsv

# Strategy 2: redirecting the output
scAVENGERS cluster -r ref.mtx -a alt.mtx -b barcodes.txt > clusters_tmp.tsv

Because scAVENGERS cluster does not perform doublet detection, excluding doublet barcodes before or after running scAVENGERS cluster is required. To note, the output of scAVENGERS is compatible to troublet in souporcell pipeline, so you can use troublet to detect doublets after demultiplexing.

# Strategy 1: Filtering out doublet barcodes after running doublet detection tools
cat clusters_tmp.tsv | LC_ALL=C grep -F -f $SINGLET_BARCODES > clusters.tsv

# Strategy 2: Using troublet as doublet detection tool
$TROUBLET_DIR/troublet -r ref.mtx -a alt.mtx --clusters clusters_tmp.tsv > clusters.tsv

Parameters

scAVENGERS/scAVENGERS cluster --help
usage: cluster.py [-h] -r REF -a ALT [-v VCF] -b BARCODES -o OUTPUT -k CLUSTERS [--priors PRIORS [PRIORS ...]] [--ploidy PLOIDY] [--err_rate ERR_RATE]
                  [--stop_criterion STOP_CRITERION] [--max_iter MAX_ITER] [-t THREADS]

optional arguments:
  -h, --help            show this help message and exit
  -r REF, --ref REF     Reference allele count matrix in mtx format
  -a ALT, --alt ALT     Alternate allele count matrix in mtx format
  -v VCF, --vcf VCF     Vcf file
  -b BARCODES, --barcodes BARCODES
                        Line-seperated text file of barcode sequences
  -o OUTPUT, --output OUTPUT
                        Output directory.
  -k CLUSTERS, --clusters CLUSTERS
                        Number of donors.
  --priors PRIORS [PRIORS ...]
                        Number or proportion of cells in each genotype.
  --ploidy PLOIDY       Ploidy. Defaults to 2.
  --err_rate ERR_RATE   Baseline probability. DO NOT set this parameter zero, because it leads to log-zeros. Defaults to 0.001.
  --stop_criterion STOP_CRITERION
                        log likelihood change to define convergence for EM algorithm
  --max_iter MAX_ITER   number of maximum iterations for a temperature step
  -t THREADS, --threads THREADS
                        number of threads