Background#

Motivation for this Pipeline#

Genetic ancestry is an important piece of information when assessing genetic and transcriptomic data across multiple different individuals.

Often, single cell RNA-sequencing data is produced on samples from donors whose personal information is unknown to the researchers. For example, ancestral inforamtion is often not ascertained at sample collection and self-reporting can sometimes be misleading. Single nucleotide polymorphism (SNP) genotyping of excess sample can be used to estimate ancestry. However, excess sample is not always available - especially when using publicly available data. Of course,

With this in mind, we established this pipeline to estimate genetic ancestry from scRNA-seq data by:

  1. Estimating genetic information from the scRNA-seq reads (using freebayes)

  2. Aligning the samples to 1000 Genomes principal component (PC) space

  3. Predicting the ancestry by training a k nearest neighbors model on the 1000 Genomes data.

Note

We have provided instructions to run this pipeline in two ways:

  1. Through a snakemake pipeline (suggested especially for datasets with multiple pools and individuals)

  2. Manually which can be used when data do not fit the assumptions in the snakemake pipeline or just to get a better idea of the steps that are implemented in the snakemake pipeline.

First Steps#

First, proceed to the Installation section to download the required singularity image and set up the pipeline locally.

Support#

If you have any questions, suggestions or issues with any part of the Ancestry Prediction from scRNA-seq Data Pipeline, feel free to submit an issue or email Drew Neavin (d.neavin @ garvan.org.au)