Table of Contents
First, an admixed dataset is simulated using the
SimulatePop function, considering N=50
individuals of ploidy P=8, M=200 markers with
L=10 alleles distributed on C=5 chromosomes,
and K=4 ancestral groups.
The resulting DataSim object includes:
Geno: list of genotyping matrices over markershead(DataSim$Geno[[1]])
#> Allele1 Allele2 Allele3 Allele4 Allele5 Allele6 Allele7 Allele8 Allele9
#> Ind1 0 1 0 0 1 1 1 0 1
#> Ind2 0 2 3 0 1 0 1 0 0
#> Ind3 0 5 2 0 0 0 1 0 0
#> Ind4 0 1 2 2 0 0 1 2 0
#> Ind5 0 3 1 1 0 0 2 0 0
#> Ind6 0 1 1 1 1 2 1 1 0
#> Allele10
#> Ind1 3
#> Ind2 1
#> Ind3 0
#> Ind4 0
#> Ind5 1
#> Ind6 0Prop: matrix of admixture proportionshead(DataSim$Prop)
#> K1 K2 K3
#> Ind1 0.120000 0.358125 0.521875
#> Ind2 0.511250 0.386250 0.102500
#> Ind3 0.826875 0.000000 0.173125
#> Ind4 0.532500 0.467500 0.000000
#> Ind5 0.802500 0.000000 0.197500
#> Ind6 0.288125 0.242500 0.469375Freq: list of matrices of allele frequencies in
ancestral groupsDataSim$Freq[[1]]
#> Allele1 Allele2 Allele3 Allele4 Allele5 Allele6 Allele7
#> K1 0.02999937 0.3216599409 0.2827303 0.06252135 0.04569980 0.01259175 0.1560145
#> K2 0.06062462 0.0007481235 0.3483231 0.01478613 0.04326888 0.04442491 0.1591765
#> K3 0.00337067 0.0736675715 0.3034438 0.06786697 0.17255898 0.11904414 0.1678521
#> Allele8 Allele9 Allele10
#> K1 0.01215938 0.019408693 0.05721497
#> K2 0.19739411 0.001450772 0.12980285
#> K3 0.03792193 0.036377344 0.01789647GeneticMap: genetic map dataframeFrom simulated data, admixture proportions can be estimated using the
AdmixGlobal function:
ResGlobalAdmix <- AdmixPoly::AdmixGlobal(Geno=DataSim$Geno, K=3L, MaxIter=100, Verbose=F, NbThreads = 1)The estimated admixture proportions are available from:
head(ResGlobalAdmix$Prop)
#> K1 K2 K3
#> Ind1 0.07282037 0.5866318131 3.405478e-01
#> Ind2 0.54379083 0.0285240072 4.276852e-01
#> Ind3 0.99986987 0.0001291292 1.000000e-06
#> Ind4 0.52300965 0.0000010000 4.769894e-01
#> Ind5 0.89501427 0.1049773323 8.402404e-06
#> Ind6 0.30286023 0.4752940257 2.218457e-01They can be represented using the ggplot2-based
GlobalPlot function:
Estimated proportions are very close to simulated ones (up to an arbitrary labeling of groups), as indicated by the root mean square error (RMSE):
group_corresp <- apply(cor(DataSim$Prop, ResGlobalAdmix$Prop), 2, which.max)
sqrt(mean((DataSim$Prop-ResGlobalAdmix$Prop[,group_corresp])^2))
#> [1] 0.05411083Estimated allele frequencies are available from:
ResGlobalAdmix$Freq[[1]]
#> Allele1 Allele2 Allele3 Allele4 Allele5 Allele6 Allele7
#> K1 0.02606307 0.27612436 0.2928982 0.08291088 0.0966980 0.0000010 0.1683648
#> K2 0.00000100 0.09006483 0.1418759 0.08626547 0.1543653 0.1179365 0.2086484
#> K3 0.10961349 0.00000100 0.3478162 0.00000100 0.0000010 0.0866857 0.1102827
#> Allele8 Allele9 Allele10
#> K1 0.00000100 0.00000100 0.05693766
#> K2 0.09163336 0.04238131 0.06682790
#> K3 0.17349010 0.00000100 0.17210782The convergence can be checked from the log-likelihood:
Let us consider a second admixed population from the same ancestral groups with a higher admixture level by specifying a higher ‘AlphaProp’ value.
DataSim2 <- AdmixPoly::SimulatePop(K=3L, N=50L, P=8L, M=200L, C=2L, L=10L,
AlphaProp=10, Freq=DataSim$Freq, NbThreads = 1)The higher admixture level can be illustrated using a barplot of simulated amdixture proportions.
In this case, it can be beneficial to initialize the algorithm with ancestral allele frequencies estimated using the first population ‘FreqInit=ResGlobalAdmix$Freq’ and only update admixture proportions ‘ParamToUpdate=“Prop”’.
ResGlobalAdmix2 <- AdmixPoly::AdmixGlobal(Geno=DataSim2$Geno, K=3L, MaxIter=100, Verbose=F,
FreqInit=ResGlobalAdmix$Freq, ParamToUpdate="Prop",
NbThreads=1)Again, estimated proportions are very close to simulated ones, as indicated by the RMSE:
Based on global admixture results, local admixture can be estimated
for a given individual using the AdmixLocal function:
Ind <- "Ind1"
ResLocalAdmix <- AdmixPoly::AdmixLocal(Geno=DataSim$Geno, ResAdmixGlobal=ResGlobalAdmix,
GeneticMap=DataSim$GeneticMap, Ind=Ind, P = 8L,
SmoothParam=1, Verbose=F, NbThreads = 1)Local ancestry dosages based on posterior probabilities are available from:
head(ResLocalAdmix$Posterior$Chr1)
#> K1 K2 K3
#> Marker1 0.02703237 5.294020 2.678948
#> Marker2 0.02187958 5.289784 2.688336
#> Marker3 0.01811682 5.259741 2.722142
#> Marker4 0.01743961 5.202371 2.780189
#> Marker5 0.01576887 5.137736 2.846495
#> Marker6 0.01484104 5.129085 2.856074Alternatively, local ancestry dosages based on the Viterbi algorithm are available from:
head(ResLocalAdmix$Viterbi$Chr1)
#> K1 K2 K3
#> Marker1 0 5 3
#> Marker2 0 5 3
#> Marker3 0 5 3
#> Marker4 0 5 3
#> Marker5 0 5 3
#> Marker6 0 5 3The RMSE of estimated local ancestry proportions (ancestry dosages divided by the ploidy) is:
sqrt(mean((DataSim$Ancestry[[Ind]]$Chr1/8-ResLocalAdmix$Posterior$Chr1[,group_corresp]/8)^2))
#> [1] 0.07136965
sqrt(mean((DataSim$Ancestry[[Ind]]$Chr1/8-ResLocalAdmix$Viterbi$Chr1[,group_corresp]/8)^2))
#> [1] 0.07971303Local ancestry dosages can be represented using the ggplot2-based
LocalPlot function:
Local admixture inference over the five first individuals can be run using:
ResLocalAdmix_list <- lapply(paste0("Ind",1:5),function(i){
res_i <- AdmixPoly::AdmixLocal(Geno=DataSim$Geno, ResAdmixGlobal=ResGlobalAdmix,
GeneticMap=DataSim$GeneticMap, Ind=i, P=8L,
SmoothParam=1, Verbose=F, NbThreads=1)
return(res_i)
})When computing time is large, each individual can be run in parallel on a high-performance computing cluster.
Genotyping data formatted as a variant call format (VCF) can be imported as:
vcf_path <- system.file("extdata", "Test.vcf", package="AdmixPoly")
DataVCF <- AdmixPoly::ReadVCF(File=vcf_path, NbThreads=1)The resulting DataVCF object includes:
MarkerInfo: first eight columns of the VCFhead(DataVCF$MarkerInfo)
#> # A tibble: 6 × 9
#> MARKER CHROM POS ID REF ALT QUAL FILTER INFO
#> <chr> <chr> <int> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 Chr1_15892318 Chr1 15892318 . A T . . .
#> 2 Chr1_31209423 Chr1 31209423 . T C . . .
#> 3 Chr1_39962954 Chr1 39962954 . C G . . .
#> 4 Chr1_40515463 Chr1 40515463 . G A . . .
#> 5 Chr1_44904209 Chr1 44904209 . A T . . .
#> 6 Chr1_47542399 Chr1 47542399 . T C . . .Geno: list of genotyping matriceshead(DataVCF$Geno[[1]])
#> A T
#> IND1 4 4
#> IND2 5 3
#> IND3 1 7
#> IND4 4 4
#> IND5 0 8
#> IND6 4 4GeneticMap: genetic map dataframe in which physical
between first and last marker of each chromosome are scaled to 100
cMhead(DataVCF$GeneticMap)
#> # A tibble: 6 × 3
#> Chromosome Marker Distance
#> <chr> <chr> <dbl>
#> 1 Chr1 Chr1_15892318 0
#> 2 Chr1 Chr1_31209423 20.4
#> 3 Chr1 Chr1_39962954 32.1
#> 4 Chr1 Chr1_40515463 32.9
#> 5 Chr1 Chr1_44904209 38.7
#> 6 Chr1 Chr1_47542399 42.2Instead of estimated dosages, read depth ratios can be imported by specifying the allele read depths field of the FORMAT column:
head(DataVCF2$Geno[[1]])
#> A T
#> IND1 0.41935484 0.5806452
#> IND2 0.50000000 0.5000000
#> IND3 0.08108108 0.9189189
#> IND4 0.68181818 0.3181818
#> IND5 0.03125000 0.9687500
#> IND6 0.53125000 0.4687500Please note that read depth ratios are not scaled to the ploidy level, which must then be informed in the inference functions.
Another function can be used to read the haplotype presence-absence (HPA) format obtained from HaploCharmer: