Annotate Ancestry of Samples Using Reference Genotypes#
We have provided instructions here to annotate your samples that have genotype data on your samples with either SNP genotype data or whole genome or exome sequencing.
This will allow users that have additional genotype data for their samples to compare the predictions from the sc sequencihng to the reference genotype data.
To do this, you will need your data will need to be on the GRCh37 (hg19) reference - sorry, we haven’t provided instructions or resources for predicting ancestry for data directly on GRCh38 (hg38). So if your data is on GRCh38 (hg38), you will have to first lift your datat to hg19 (GRCh37).
The general steps are:
Convert Data to Plink pgen Format Optional
Identify Common SNPs Between 1000G Data and Your Data Required
Prune the SNPs in LD Required
Filter the 1000G and Freebayes SNPs for Pruned SNPs Required
Update Chromosome IDs Required
Calculate PCs for 1000G Required
Plot PC Results Required
Compare Reference to single cell-predicted Ancestry Optional
Setup#
There are a few variables that will be used throughout the example commands which are best to define in a file that you can easily run at each step or source for execution of each step.
$BIND
- the path(s) on your system to bind to singularity when executing commands. By default, Singularity binds just the directory (and downstream dierectory and files) from your current working directory when executing the command. However, if you have some files elsewhere on your system, you can provide a parent directory to the singulairty command to indicate which additional directories to bind. Multiple directories can be included and separated by a comma (i.e. $DIR1,$DIR2)$SIF
- the path to the singularity image downloaded in the Installation section$OUTDIR
- the path to the output directory where all results will be written and saved
BIND = /bind/path
SIF = /path/to/singularity/image/ancestry_prediction_scRNAseq.sif
OUTDIR = /path/to/base/outdir
Steps#
1. Convert Data to Plink pgen Format#
Optional
Expected Timing
< 10 minutes but may be longer with large datasets
This step will convert your file to a Plink pgen format which is required for input into the downstream steps. If your data is already in the Plink pgen format, you can skip this step and move directly to Identify Common SNPs Between 1000G Data and Your Data.
In preparation for this step, we set some additional parameters and create the required output directory. The parameters that we use for the command in this step are:
VCF=/path/to/vcf.vcf
This is pretty self-explanatory - this is the path to your sample genotype file.
BED_BASE=/path/to/plink/bed_basename
The
$BED_BASE
is the your sample genotype Plink bedfiles basename. For example if your files are nameddata.bed
,data.fam
anddata.bim
yourbed_basename
would bedata
.
singularity exec --bind $BIND $SIF plink2 --vcf $VCF --make-pgen --out $OUTDIR/data_name
Note
This command will generate three files using the out basename provided: $OUTDIR/data_name.pgen
, $OUTDIR/data_name.pvar
, $OUTDIR/data_name.psam
singularity exec --bind $BIND $SIF plink2 --bfile $BED_BASE --make-pgen --out $OUTDIR/data_name
Note
This command will generate three files using the out basename provided: $OUTDIR/data_name.pgen
, $OUTDIR/data_name.pvar
, $OUTDIR/data_name.psam
2. Identify Common SNPs Between 1000G Data and Your Data#
Required
This step will identify the genetic variants in common between your data and the 1000G reference and then subset your data and 1000G reference data for the genetic variants in common between the two datasets. The 1000G reference data is provided within the Singularity image so you will not have to download it separately. The paths to the 1000G reference data are the paths within the Singularity image.
singularity exec --bind $BIND $SIF awk 'NR==FNR{a[$1,$2,$4,$5];next} ($1,$2,$4,$5) in a{print $3}' $OUTDIR/data_name.pvar /opt/1000G/all_phase3_filtered.pvar > $OUTDIR/common_snps/snps_1000g.tsv
singularity exec --bind $BIND $SIF awk 'NR==FNR{a[$1,$2,$4,$5];next} ($1,$2,$4,$5) in a{print $3}' /opt/1000G/all_phase3_filtered.pvar $OUTDIR/data_name.pvar > $OUTDIR/common_snps/snps_data.tsv
singularity exec --bind $BIND $SIF plink2 --threads 2 --pfile $OUTDIR/data_name --extract $OUTDIR/common_snps/snps_data.tsv --make-pgen 'psam-cols='fid,parents,sex,phenos --out $OUTDIR/common_snps/subset_data
singularity exec --bind $BIND $SIF plink2 --threads 2 --pfile /opt/1000G/all_phase3_filtered --extract $OUTDIR/common_snps/snps_1000g.tsv --make-pgen --out $OUTDIR/common_snps/subset_1000g
3. Prune the SNPs in LD#
Required
This step will prune the genotype data so that the genetic variants that are in low linkage disequilibrium.
singularity exec --bind $BIND $SIF plink2 --threads 2 --pfile /opt/1000G/all_phase3_filtered \
--indep-pairwise 50 5 0.5 \
--out $OUTDIR/common_snps/subset_pruned_1000g
4. Filter the 1000G and Freebayes SNPs for Pruned SNPs#
Required
This step will prune the genotype data so that the genetic variants that are in low linkage disequilibrium (uniquely representing genetic variation across the genome).
singularity exec --bind $BIND $SIF plink2 --threads 2 --pfile /opt/1000G/all_phase3_filtered --extract $OUTDIR/common_snps/subset_pruned_1000g.prune.out --make-pgen --out $OUTDIR/common_snps/subset_pruned_1000g
if [[ $(grep "##" $OUTDIR/common_snps/subset_1000g.pvar | wc -l) > 0 ]]
then
singularity exec --bind $BIND $SIF grep "##" $OUTDIR/common_snps/subset_1000g.pvar > $OUTDIR/common_snps/subset_pruned_data_1000g_key.txt
fi
singularity exec --bind $BIND $SIF awk -F"\\t" 'BEGIN{OFS=FS = "\\t"} NR==FNR{a[$1 FS $2 FS $4 FS $5] = $0; next} {ind = $1 FS $2 FS $4 FS $5} ind in a {print a[ind], $3}' $OUTDIR/common_snps/subset_pruned_1000g.pvar $OUTDIR/common_snps/subset_pruned_data.pvar | singularity exec --bind $BIND $SIF grep -v "##" >> $OUTDIR/common_snps/subset_pruned_data_1000g_key.txt
singularity exec --bind $BIND $SIF grep -v "##" $OUTDIR/common_snps/subset_pruned_data_1000g_key.txt | singularity exec --bind $BIND $SIF awk 'BEGIN{FS=OFS="\t"}{print $NF}' > $OUTDIR/common_snps/subset_data.prune.out
singularity exec --bind $BIND $SIF plink2 --threads 2 --pfile $OUTDIR/common_snps/subset_data --extract $OUTDIR/common_snps/subset_data.prune.out --make-pgen 'psam-cols='fid,parents,sex,phenos --out $OUTDIR/common_snps/subset_pruned_data
singularity exec --bind $BIND $SIF cp $OUTDIR/common_snps/subset_pruned_data.pvar $OUTDIR/common_snps/subset_pruned_data_original.pvar
singularity exec --bind $BIND $SIF grep -v "#" $OUTDIR/common_snps/subset_pruned_data_original.pvar | singularity exec --bind $BIND $SIF awk 'BEGIN{FS=OFS="\t"}{print($3)}' > $OUTDIR/common_snps/SNPs2keep.txt
singularity exec --bind $BIND $SIF grep "#CHROM" $OUTDIR/common_snps/subset_pruned_data_1000g_key.txt > $OUTDIR/common_snps/subset_pruned_data.pvar
singularity exec --bind $BIND $SIF grep -Ff $OUTDIR/common_snps/SNPs2keep.txt $OUTDIR/common_snps/subset_pruned_data_1000g_key.txt >> $OUTDIR/common_snps/subset_pruned_data.pvar
singularity exec --bind $BIND $SIF awk 'BEGIN{FS=OFS="\t"}NF{NF-=1};1' < $OUTDIR/common_snps/subset_pruned_data.pvar > $OUTDIR/common_snps/subset_pruned_data_temp.pvar
singularity exec --bind $BIND $SIF grep "##" $OUTDIR/common_snps/subset_pruned_data_temp.pvar > $OUTDIR/common_snps/subset_pruned_data.pvar
singularity exec --bind $BIND $SIF cat $OUTDIR/common_snps/subset_pruned_data_temp.pvar >> $OUTDIR/common_snps/subset_pruned_data.pvar
singularity exec --bind $BIND $SIF plink2 --threads 2 --pfile $OUTDIR/common_snps/subset_pruned_1000g --make-bed --out $OUTDIR/common_snps/subset_pruned_1000g
### This is a contingency to remove duplicated snps from both
singularity exec --bind $BIND $SIF plink2 --rm-dup 'force-first' -threads 2 --pfile $OUTDIR/common_snps/subset_data --make-pgen 'psam-cols='fid,parents,sex,phenos --out $OUTDIR/common_snps/final_subset_pruned_data
singularity exec --bind $BIND $SIF plink2 --rm-dup 'force-first' -threads 2 --pfile $OUTDIR/common_snps/subset_data --make-bed --out $OUTDIR/common_snps/subset_pruned_data
5. Update Chromosome IDs#
Required
Plink requires specific chromosome encodings for chromosomes X, Y, and mitochondria so we will update them to be sure.
singularity exec --bind $BIND $SIF cp $OUTDIR/common_snps/final_subset_pruned_data.bed $OUTDIR/common_snps/split/
singularity exec --bind $BIND $SIF cp $OUTDIR/common_snps/final_subset_pruned_data.bim $OUTDIR/common_snps/split/
singularity exec --bind $BIND $SIF sed -i 's/^X/23/g' $OUTDIR/common_snps/split/final_subset_pruned_data.bim
singularity exec --bind $BIND $SIF sed -i 's/^Y/24/g' $OUTDIR/common_snps/split/final_subset_pruned_data.bim
singularity exec --bind $BIND $SIF sed -i 's/^XY/25/g' $OUTDIR/common_snps/split/final_subset_pruned_data.bim
singularity exec --bind $BIND $SIF sed -i 's/^MT/26/g' $OUTDIR/common_snps/split/final_subset_pruned_data.bim
singularity exec --bind $BIND $SIF cp $OUTDIR/common_snps/final_subset_pruned_data.fam $OUTDIR/common_snps/split/
singularity exec --bind $BIND $SIF cp $OUTDIR/common_snps/subset_pruned_1000g.bed $OUTDIR/common_snps/split/
singularity exec --bind $BIND $SIF cp $OUTDIR/common_snps/subset_pruned_1000g.bim $OUTDIR/common_snps/split/
singularity exec --bind $BIND $SIF cp $OUTDIR/common_snps/subset_pruned_1000g.fam $OUTDIR/common_snps/split/
singularity exec --bind $BIND $SIF sed -i 's/^X/23/g' $OUTDIR/common_snps/split/subset_pruned_1000g.bim
singularity exec --bind $BIND $SIF sed -i 's/^Y/24/g' $OUTDIR/common_snps/split/subset_pruned_1000g.bim
singularity exec --bind $BIND $SIF sed -i 's/^XY/25/g' $OUTDIR/common_snps/split/subset_pruned_1000g.bim
singularity exec --bind $BIND $SIF sed -i 's/^MT/26/g' $OUTDIR/common_snps/split/subset_pruned_1000g.bim
6. Calculate PCs for 1000G#
Required
Next, we will calculate principal components using the 1000G reference data.
singularity exec --bind $BIND $SIF plink2 --threads 2 --pfile $OUTDIR/common_snps/subset_pruned_1000g \
--freq counts \
--pca allele-wts \
--out $OUTDIR/pca_projection/subset_pruned_1000g_pcs
7. Project 1000G and Freebayes Data in PCs#
Required
Next, let’s project the data (both your data and the 1000G reference data) into the principal component space that you calculated in the last step.
export OMP_NUM_THREADS=2
singularity exec --bind $BIND $SIF plink2 --threads 2 --pfile $OUTDIR/common_snps/final_subset_pruned_data \
--read-freq $OUTDIR/pca_projection/subset_pruned_1000g_pcs.acount \
--score $OUTDIR/pca_projection/subset_pruned_1000g_pcs.eigenvec.allele 2 5 header-read no-mean-imputation \
variance-standardize \
--score-col-nums 6-15 \
--out $OUTIR/pca_projection/final_subset_pruned_data_pcs
singularity exec --bind $BIND $SIF plink2 --threads 2 --pfile {params.infile_1000g} \
--read-freq $OUTDIR/pca_projection/subset_pruned_1000g_pcs.acount \
--score $OUTDIR/pca_projection/subset_pruned_1000g_pcs.eigenvec.allele 2 5 header-read no-mean-imputation \
variance-standardize \
--score-col-nums 6-15 \
--out $OUTDIR/pca_projection/subset_pruned_1000g_pcs_projected
8. Plot PC Results#
Required
Fnally, let’s plot the results and predict the ancestry of the samples in your dataset.
singularity exec --bind $BIND $SIF echo $OUTDIR/pca_sex_checks_original > $OUTDIR/pca_sex_checks_original/variables.tsv
singularity exec --bind $BIND $SIF echo $OUTDIR/pca_projection/final_subset_pruned_data_pcs.sscore >> $OUTDIR/pca_sex_checks_original/variables.tsv
singularity exec --bind $BIND $SIF echo $OUTDIR/pca_projection/subset_pruned_1000g_pcs_projected.sscore >> $OUTDIR/pca_sex_checks_original/variables.tsv
singularity exec --bind $BIND $SIF echo $OUTDIR/common_snps/subset_1000g.psam >> $OUTDIR/pca_sex_checks_original/variables.tsv
singularity exec --bind $BIND $SIF Rscript /opt/ancestry_prediction_scRNAseq/scripts/PCA_Projection_Plotting_original.R $OUTDIR/pca_sex_checks_original/variables.tsv
9. Compare Reference to single cell-predicted Ancestry#
Optional
The last step is to compare the predicted ancestries in your genotyped data (either microarray or whole genome or exome sequencing data) to the predicted ancestries from your single cell data. We assume that the ID you use for the sample is the same in bpoth the single cell and genotyped data.
You will need to provide a metadata file for each of the Pools and samples in each pool as input for this step. The file is the same Sample metadata file.
In preparation for this step, we set some additional parameters and create the required output directory. The parameters that we use for the command in this step are:
META=/path/to/meta_file.tsv
We assume that the results for each of the samples that you have single cell sequencing are located in the same base directory in the following format:
.
├── Pool1
│ ├── individual_1
│ ├── ...
│ └── individual_n
│ ...
└── Poolm
├── individual_1
├── ...
└── individual_n
singularity exec --bind $BIND $SIF echo $OUTDIR > $OUTDIR/ref_sc_ancestry_prediction_comparison/variables.tsv
singularity exec --bind $BIND $SIF echo $OUTDIR/pca_sex_checks_original/ancestry_assignments.tsv >> $OUTDIR/ref_sc_ancestry_prediction_comparison/variables.tsv
singularity exec --bind $BIND $SIF echo $OUTDIR >> $OUTDIR/ref_sc_ancestry_prediction_comparison/variables.tsv
singularity exec --bind $BIND $SIF echo $META >> $OUTDIR/ref_sc_ancestry_prediction_comparison/variables.tsv
singularity exec --bind $BIND $SIF Rscript /opt/ancestry_prediction_scRNAseq/scripts/compare_ref_seq_snp_ancestries.R $OUTDIR/ref_sc_ancestry_prediction_comparison/variables.tsv
Results#
After running the final step, you should have the following results directories.
We’ve highlighted the key results files (Ancestry_PCAs.png
and ancestry_assignments.tsv
):
.
├── common_snps
│ ├── final_subset_pruned_data.bed
│ ├── final_subset_pruned_data.bim
│ ├── final_subset_pruned_data.fam
│ ├── final_subset_pruned_data.log
│ ├── final_subset_pruned_data.pgen
│ ├── final_subset_pruned_data.psam
│ ├── final_subset_pruned_data.pvar
│ ├── snps_1000g.tsv
│ ├── SNPs2keep.txt
│ ├── snps_data.tsv
│ ├── subset_1000g.log
│ ├── subset_1000g.pgen
│ ├── subset_1000g.psam
│ ├── subset_1000g.pvar
│ ├── subset_data.log
│ ├── subset_data.pgen
│ ├── subset_data.prune.out
│ ├── subset_data.psam
│ ├── subset_data.pvar
│ ├── subset_pruned_1000g.bed
│ ├── subset_pruned_1000g.bim
│ ├── subset_pruned_1000g.fam
│ ├── subset_pruned_1000g.log
│ ├── subset_pruned_1000g.pgen
│ ├── subset_pruned_1000g.popu
│ ├── subset_pruned_1000g.prune.in
│ ├── subset_pruned_1000g.prune.out
│ ├── subset_pruned_1000g.psam
│ ├── subset_pruned_1000g.pvar
│ ├── subset_pruned_data_1000g_key.txt
│ ├── subset_pruned_data.log
│ ├── subset_pruned_data_original.pvar
│ ├── subset_pruned_data.pgen
│ ├── subset_pruned_data.psam
│ ├── subset_pruned_data.pvar
│ └── subset_pruned_data_temp.pvar
├── pca_projection
│ ├── final_subset_pruned_data_pcs.log
│ ├── final_subset_pruned_data_pcs.sscore
│ ├── subset_pruned_1000g_pcs.acount
│ ├── subset_pruned_1000g_pcs.eigenval
│ ├── subset_pruned_1000g_pcs.eigenvec
│ ├── subset_pruned_1000g_pcs.eigenvec.allele
│ ├── subset_pruned_1000g_pcs.log
│ ├── subset_pruned_1000g_pcs_projected.log
│ └── subset_pruned_1000g_pcs_projected.sscore
├── pca_sex_checks_original
│ ├── ancestry_assignments.tsv
│ ├── Ancestry_PCAs.png
│ └── variables.tsv
├── reference.log
├── reference.pgen
├── reference.psam
└── reference.pvar
The
Ancestry_PCAs.png
figure shows the 1000G individual locations in PC space compared to the individuals in each pool. For example:The
ref_ancestry_assignments.tsv
file has the annotations and probabilities for each pool. For example:
FID |
IID |
PC1 |
PC2 |
PC3 |
PC4 |
PC5 |
PC6 |
PC7 |
PC8 |
PC9 |
PC10 |
AFR |
AMR |
EAS |
EUR |
SAS |
combined_assignment |
Final_Assignment |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 |
1 |
0.137 |
-0.108 |
-0.025 |
-0.042 |
0.032 |
-0.042 |
0.001 |
-0.021 |
-0.086 |
-0.019 |
0 |
0 |
1 |
0 |
0 |
EAS |
EAS |
… |
… |
… |
… |
… |
… |
… |
… |
… |
… |
… |
… |
… |
… |
… |
… |
… |
… |
… |