scAVENGERS cluster: demultiplexing scATAC-seq data
scAVENGERS cluster demultiplexes cell barcodes into each donor in unsupervised manner, given reference and alternate allele count matrices generated by vartrix in mtx format
Usage
scAVENGERS cluster provides cluster assignment results in a tab-seperated format.
Running this minimal command will produce results under the directory $OUTDIR.
scAVENGERS cluster -r ref.mtx -a alt.mtx -b barcodes.txt -o $OUTDIR
Because scAVENGERS cluster does not perform doublet detection, excluding doublet barcodes before or after running scAVENGERS cluster is required. To note, the output of scAVENGERS is compatible to troublet in souporcell pipeline, so you can use troublet to detect doublets after demultiplexing.
# Strategy 1: Filtering out doublet barcodes after running doublet detection tools
cat clusters.tsv | LC_ALL=C grep -F -f $SINGLET_BARCODES > clusters.final.tsv
# Strategy 2: Using troublet as doublet detection tool
$TROUBLET_DIR/troublet -r ref.mtx -a alt.mtx --clusters clusters.tsv > clusters.final.tsv
Results
Output files
Under the designated output directory, the three files are generated.
name |
description |
|---|---|
clusters.tsv |
cluster result file |
gt_matrix.npz |
donor-variant matrix of alternative allele counts |
variant_index.npz |
indices of variants used for demultiplexing |
Structure of cluster result file
The cluster result file clusters.tsv contains barcodes, assigned cluster, and likelihood for each cluster in each column.
The first column contains barcode sequences.
The second column contains cluster assignment results.
From the third column, log likelihoods are written.
Parameters
scAVENGERS/scAVENGERS cluster --help
usage: cluster.py [-h] -r REF -a ALT [-v VCF] -b BARCODES -o OUTPUT -k CLUSTERS [--priors PRIORS [PRIORS ...]]
[--ploidy PLOIDY] [--coverage COVERAGE] [--err_rate ERR_RATE] [--stop_criterion STOP_CRITERION]
[--max_iter MAX_ITER] [-t THREADS]
optional arguments:
-h, --help show this help message and exit
-r REF, --ref REF Reference allele count matrix in mtx format
-a ALT, --alt ALT Alternate allele count matrix in mtx format
-v VCF, --vcf VCF Vcf file
-b BARCODES, --barcodes BARCODES
Line-seperated text file of barcode sequences
-o OUTPUT, --output OUTPUT
Output directory.
-k CLUSTERS, --clusters CLUSTERS
Number of donors.
--priors PRIORS [PRIORS ...]
Number or proportion of cells in each genotype.
--ploidy PLOIDY Ploidy. Defaults to 2.
--coverage COVERAGE Minimum coverage of variant to use for clustering
--err_rate ERR_RATE Baseline probability. DO NOT set this parameter zero, because it leads to log-zeros. Defaults to
0.001.
--stop_criterion STOP_CRITERION
log likelihood change to define convergence for EM algorithm
--max_iter MAX_ITER number of maximum iterations for a temperature step
-t THREADS, --threads THREADS
number of threads