Welcome to pooledpeaks. This guide introduces researchers to
pooledpeaks for analyzing microsatellite markers, scoring
.fsa
files, and calculating genetic measures like Nei’s GST
and Jost’s D. Basic R skills are required, but most steps are
straightforward.
This vignette outlines:
.fsa
files.Each step builds on the previous, so follow the vignette sequentially. However, the three sections, Peak Scoring, Data Manipulation, and Genetic Analysis, can be run separately.
To get started you will need to set up the R environment by setting the working directory, loading the required libraries, reading in the source files with the functions written specifically for this analysis pipeline.
In addition to the pooledpeaks
library the following
packages are require to utilize this package to its fullest
capacity.
Identify where the .fsa
files are located on your
computer, load in the eggcount data (should be an excel or csv file but
can be a dataframe), and provide the expected peak size panels for your
markers and ladder.
file_path <- system.file("extdata", package = "pooledpeaks")
eggcount <- data.frame(
ID = c("X104.1", "X1084.1", "X1084.3", "X1086.3", "X1087.3", "X1205.3",
"X121.3", "X1222.3", "X1354.3", "X1453.3", "X1531.3", "X1540.1",
"X1550.3", "X1796.1", "X1809.1", "X1968.1", "X1968.3", "X2100.1",
"X2462.1", "X2463.1", "X473.1", "X620.1", "X620.3", "X679.1",
"X910.1", "X910.3"),
n = c(192, 126, 185, 171, 140, 20, 46, 80, 156, 154, 122, 19, 45, 117, 75,
22, 175, 100, 97, 183, 67, 90, 157, 104, 195, 145)
)
Shae10 <- c(161,164,167,170,173,176,179,182,185,188,191,194,197,200,203,206,209,
212,215,218)
mic_SMMS2 <- c(211, 215, 219, 223, 227, 231, 235, 239)
GS600LIZ <- c(20, 40, 60, 80, 100, 114, 120, 140, 160, 180, 200, 214, 220,
240, 250, 260, 280, 300, 314, 320, 340, 360, 380, 400, 414,
420, 440, 460, 480, 500, 514, 520, 540, 560, 580, 600)
With your data loaded, we can move on to the peak scoring section. To
facilitate this process, pooledpeaks incorporates
functionality adapted from the Fragman
package, originally
developed for microsatellite typing in plants. These adaptations allow
for the scoring of both allele sizes and their corresponding heights in
the newer .fsa
file versions.
This section demonstrates how to import the .fsa
files
from the file directory into R and combine all of them into a list of
data frames, wherein each file is stored as a data frame within the
list. channels
specifies that we are using a five channel
dye set of which 1-4 are fluorescent label colors, and 5 contains the
ladder. fourier
and saturated
should both be
set to TRUE
and lets.pullup
to
FALSE
. When rawPlot
is set to
TRUE
, the function will provide an overview of all peaks
across all files within each channel. Once .fsa
files have
been imported, the dyes must be associated with the channels. This can
be done using associate_dyes()
.
fsa_data <- fsa_batch_imp(file_path, channels = 5, rawPlot = FALSE,
fourier = TRUE, saturated = TRUE,
lets.pullup = FALSE)
#> Reading FSA files
#> Applying Fourier tranformation for smoothing...
#> Checking and correcting for saturated peaks...
fsa_data <- associate_dyes(fsa_data, file_path)
#> Columns have been renamed with associated fluoresecent dye.
Note: If you are encountering issues, you may want
to consider checking the file version and/or metadata using
check_fsa_v_batch()
or fsa_metadata()
.
To calibrate fragment sizes, the internal size marker peaks in each
sample must match the expected sizes from the ladder. For this example,
we use the LIZ600
object, which contains the expected
allele sizes for the ladder. If you are using a different ladder or wish
to adjust the fragment sizes, modify the c(...)
values in
the setup.
Next, we associate the ladder with the imported data. This step is
performed once per dataset and ensures proper sizing by comparing the
expected ladder sizes with the observed values. The program checks the
correlation between these values, outputs the correlation results, and
flags poorly correlated samples (<99.9%) in a vector named
bad
.
ladder.info.attach(stored = fsa_data,ladder = GS600LIZ,
ladd.init.thresh = 200, prog = FALSE, draw = FALSE)
corro <- unlist(sapply(list.data.covarrubias, function(x){x$corr}))
bad <- which(corro < .999)
Note: If warnings are thrown, lowering the
ladd.init.thresh
may resolve the issue, or certain samples
may need to be addressed manually as per the Fragman
documentation (run ?ladder.corrector
).
The above chunks set up all samples for all markers and only need to be done once per analysis. The following steps will need to be repeated as many times as the number of microsatellite markers you have.
Using the score_markers_rev3
function (adapted from
Fragman
), you can score genotyped peaks based on size
(weight) and intensity (height). This function bins peaks by comparing
observed fragment sizes to expected microsatellite fragment sizes.
Key Parameters to Customize
my.inds
: The object containing your .fsa
data.
channel
: The fluorescence channel to analyze (e.g.,
1 = blue, 2 = green, etc.).
panel
: Expected fragment sizes for your
sample.
Ladder
: The ladder associated with your
dataset.
init.thresh
: RFU value threshold to consider a peak
valid.
ploidy
: The number of possible alleles per marker
(e.g., for diploids, ploidy = 2).
Other options like window
(distance from the expected
size to count as a peak) and shift
(handling stutter peaks)
can be adjusted as needed. Refer to the Fragman
documentation for detailed explanations.
Additional Updates of
score_markers_rev390
Allows separate left/right “window” search specifications.
Disables progress bars and unused options like electrogram plotting.
Saves plots to a specified folder when
plotting = TRUE
and plotdir
is provided.
plotdir
should be formatted with the /
after
the directory name (eg. “plot_scoring/” for iOS).
scores_SMMS2 <- score_markers_rev3(my.inds = fsa_data,
channel = 1,
channel.ladder = 5,
panel = "mic_SMMS2",
ladder = GS600LIZ,
init.thresh = 100,
ploidy = length(mic_SMMS2),
shift = 1,
windowL = 1,
windowR= 1,
left.cond = c(0, 2.5),
right.cond = 0,
pref = 1,
plotting = FALSE
)
#>
#> Reducing threshold 2x to find ladder
#>
#> Reducing threshold 2x to find ladder
scores_Shae10 <- score_markers_rev3(my.inds = fsa_data,
channel = 1,
channel.ladder = 5,
panel = "Shae10",
ladder = GS600LIZ,
init.thresh = 100,
ploidy = length(Shae10),
shift = 1,
windowL = 1,
windowR= 1,
left.cond = c(0, 2.5),
right.cond = 0,
pref = 1,
plotting = FALSE
)
#>
#> Reducing threshold 2x to find ladder
#>
#> Reducing threshold 2x to find ladder
Note: The author recommends setting
plotting
to TRUE
and then visually inspecting
the PDFs to confirm that each peak is being called as expected. If they
are not, adjust the parameters until satisfied.
After scoring peaks, combine the data frames for all samples of the same marker into a single data frame instead of a list of lists You’ll also clean the sample IDs for consistency and prepare the data for downstream analyses.
Workflow
1. Combine Data and Create Simplified IDs
clean_scores()
row-binds all the individual data frames
and removes machine-added information from the ID column, keeping only
the collection number and replicate
(e.g., filename:
104.1a_FA060920_2020-06-09_C05.fsa
becomes
ID: 104.1a
).
scores_SMMS2_lf<-clean_scores(scores_SMMS2, pattern1 = "_FA.*",replacement1 = "",
pattern2 = "_Sample.*", replacement2 = "")
scores_Shae10_lf<-clean_scores(scores_Shae10, pattern1 = "_FA.*",replacement1 = "",
pattern2 = "_Sample.*", replacement2 = "")
2. Transform from Long Format to Table Format
3. Export Tables
To save time in future analyses, export the processed peak data as
.txt
files. This ensures you can access the data without
rerunning the entire pipeline.
write.table(scores_SMMS2_lf, file = "scores_SMMS2_lfex.txt", col.names = NA,
quote = FALSE, row.names = TRUE, sep = "\t")
write.table(scores_SMMS2_tdf, file = "scores_SMMS2_tdfex.txt", col.names = NA,
quote = FALSE, row.names = TRUE, sep = "\t")
write.table(scores_Shae10_lf, file = "scores_Shae10_lfex.txt", col.names = NA,
quote = FALSE, row.names = TRUE, sep = "\t")
write.table(scores_Shae10_tdf, file = "scores_Shae10_tdfex.txt", col.names = NA,
quote = FALSE, row.names = TRUE, sep = "\t")
This data manipulation section is important in order to prepare for the genetic analysis but it is much simpler than the peak scoring portion above. Begin by calling in the previously exported tdf data frames. For simplicity, this section will only focus on marker SMMS2.
head(SMMS2[, 1:9])
#> X721.2a X721.2b X730.1a X730.1b X735.2a X735.2b X766.2a X766.2b X766.3a
#> 211 0 0 0.0000 0 0 0 0 0 0
#> 215 11348 4113 120.2662 0 1882 0 4551 4953 4848
#> 219 0 0 0.0000 0 0 0 214 210 171
#> 223 0 0 0.0000 0 0 0 0 0 0
#> 227 0 0 0.0000 0 0 0 0 0 0
#> 231 0 0 0.0000 0 0 0 0 0 0
The data_manipulation
function should be used to clean
the data first. It
Removes samples without at least one peak exceeding the threshold.
Eliminates alleles that are absent in all samples.
SMMS2_IDM <- data_manipulation(SMMS2, threshold = 200)
head(SMMS2_IDM[, 1:9])
#> X721.2a X721.2b X735.2a X766.2a X766.2b X766.3a X766.3b X775.1a X775.1b
#> 211 0 0 0 0 0 0 0 197 0
#> 215 11348 4113 1882 4551 4953 4848 4112 1425 10165
#> 219 0 0 0 214 210 171 172 0 0
#> 223 0 0 0 0 0 0 0 0 0
#> 231 0 0 0 0 0 0 0 0 0
#> 235 4401 2001 865 1405 1425 1718 1546 706 4989
Replicate samples are compared in the cleaned data frame (you can
skip this step if you only ran each sample once). If you do have
replicate samples, it replaces the individual columns .a
and .b
(.c
, .d
, etc.) with an
average of the two and calculates the Jost’s D between the samples.
SMMS2_repcheck <- Rep_check(SMMS2_IDM)
#> Jost D between duplicate samples
#> CWRU.a, CWRU.b, X721.2, X735.2, X766.2, X766.3, X775.1, X775.2, X775.3, X776.2, X782.1, X782.2, X782.3, X783.3, X784.3, X785.1, X785.3, X825.1, X827.1, X827.2, X827.3, X840.2, X851.1, X851.2, X851.3, X854.1, X854.2, X854.3, X860.1, X860.2, X860.3, X861.1, X861.2, X861.3, X864.1, X864.2, X864.3, X870.11, X870.12, X872.2, X879.2, X881.3
#> 0.00395560228912828, 0.000281237851261329, 0.000281765496311648, 0.0109251475313559, 0.137849261262302, 0.000277779222130015, 0.000176335180481546, 6.1302811725783e-05, 0.0047542897971915, 0.0145011127620611, 0.000761668588801223, 1.64964676880874e-05, 5.07636748872109e-05, 0.000437262678883776, 0.000812426041037373, 5.68684675530395e-05, 1.13717846008665e-05, 0.00134921515936237, 0.000187337928136078, 0.000285444846679139, 0.00361823860435218, 0.00217964713145458, 0.000339468592148284, 0.000447663166460721, 0.0824474815786376, 2.56862342034037e-05, 0.86261592873236, 0.079387290335966, 4.86004568296394e-05, 0.0369644998122574, 0.0161201309865515, NA
#> Samples where duplicates have a Jost D exceeding 0.05
#> NA, NA, X775.2, X782.1, NA, X785.1, X785.3, X827.2, X827.3, NA, NA, X860.3, X861.2, X861.3, X864.2, X864.3, NA
head(SMMS2_repcheck[, 3:11])
#> X721.2 X735.2 X766.2 X766.3 X775.1 X775.2 X775.3 X776.2 X782.1
#> 211 0.0 0 0 0.0 98.5 237.0 0.0 60.5 63.0
#> 215 7730.5 1882 4752 4480.0 5795.0 24151.5 10705.5 14112.0 8759.0
#> 219 0.0 0 212 171.5 0.0 90.0 55.0 279.0 0.0
#> 223 0.0 0 0 0.0 0.0 0.0 0.0 0.0 0.0
#> 231 0.0 0 0 0.0 0.0 0.0 0.0 0.0 0.0
#> 235 3201.0 865 1415 1632.0 2847.5 3678.0 5074.5 4562.5 4233.5
PCDM
or the Post-Consolidation Manipulation function
prepares the data for the Genetic Analysis Section by:
Matching eggcount information for each sample.
Calculating allelic frequencies.
Adding the marker name to separate the data frames once combined.
SMMS2_PCM<-PCDM(SMMS2_repcheck,eggcount,'SMMS2')
head(SMMS2_PCM[,1:6])
#> Locus_allele CWRU.a CWRU.b X721.2 X735.2 X766.2
#> 1 SMMS2 <NA> <NA> <NA> <NA> <NA>
#> 2 n <NA> <NA> <NA> <NA> <NA>
#> 3 211 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000
#> 4 215 0.3325501 0.3085097 0.7071765 0.6851110 0.7449443
#> 5 219 0.000000000 0.000000000 0.000000000 0.000000000 0.033234049
#> 6 223 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
If you have multiple markers, you can combine them into a single data
frame using functions like rbind.fill()
. This creates a
consolidated structure with one column per sample and replaces any empty
cells with NA. The processed data frame can be exported as a .txt file,
allowing for efficient reuse in future analyses without repeating these
steps.
Welcome to the final stage of the pipeline: genetic analysis! This section provides a high-level overview of the key steps involved in analyzing the processed data for population genetics. It introduces essential methods like calculating genetic distance, visualizing population structure, and creating phylogenetic trees.
Note: This is a general guide intended to demonstrate the pipeline’s capabilities, not a comprehensive or in-depth example. For detailed use cases or advanced analyses, you may need to adjust the parameters and explore additional functions in the package.
The LoadData
function modifies and saves the data frame
from the previous step as the gends object with an added column that
indexes the Locus number.
gends <- LoadData("./combined3.txt")
head(gends[1:8])
#> Locus Locus_allele J19 J82 J83 J95 J96 J143
#> 1 1 SMMS2 NA NA NA NA NA NA
#> 2 1 n 9376.000 13824.000 45144.000 42081.000 10728.000 17657.000
#> 3 1 212 0.002 0.006 0.005 0.006 0.005 0.005
#> 4 1 215.9 0.697 0.773 0.743 0.748 0.673 0.812
#> 5 1 219.8 0.000 0.000 0.000 0.000 0.000 0.000
#> 6 1 235.2 0.301 0.221 0.252 0.246 0.322 0.183
Next, we calculate the gene identity and genetic distances between samples. This step is fundamental to all downstream genetic analyses, as they are the basis for differentiation indices, clustering, phylogenetic trees, and other population genetic metrics. This involves:
TypedLoci
).GeneIdentityMatrix
).J <- GeneIdentityMatrix(gends,N)
head(J[,1:5])
#> J19 J82 J83 J95 J96
#> J19 0.3571634 0.3314657 0.3350303 0.3280159 0.3494865
#> J82 0.3314657 0.3376424 0.3383771 0.3226482 0.3341265
#> J83 0.3350303 0.3383771 0.3464639 0.3286256 0.3353571
#> J95 0.3280159 0.3226482 0.3286256 0.3276570 0.3189977
#> J96 0.3494865 0.3341265 0.3353571 0.3189977 0.3630649
#> J143 0.3531926 0.3378429 0.3381669 0.3286655 0.3565482
GeneticDistanceMatrix
).D <- GeneticDistanceMatrix(J)
head(D[,1:5])
#> J19 J82 J83 J95 J96
#> J19 0.00000000 0.015937125 0.016783286 0.014394321 0.01062764
#> J82 0.01593713 0.000000000 0.003676054 0.010001446 0.01622718
#> J83 0.01678329 0.003676054 0.000000000 0.008434786 0.01940727
#> J95 0.01439432 0.010001446 0.008434786 0.000000000 0.02636320
#> J96 0.01062764 0.016227179 0.019407268 0.026363196 0.00000000
#> J143 0.02708070 0.032669929 0.036756625 0.036854589 0.02667589
We can use the gene identity matrix to calculate Nei’s GST and Jost’s D.
print(head(GST(J)[,1:5]))
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] 0.000000000 0.012063240 0.012780877 0.010826266 0.008235933
#> [2,] 0.012063240 0.000000000 0.002785797 0.007437667 0.012335183
#> [3,] 0.012780877 0.002785797 0.000000000 0.006321440 0.014816082
#> [4,] 0.010826266 0.007437667 0.006321440 0.000000000 0.019738231
#> [5,] 0.008235933 0.012335183 0.014816082 0.019738231 0.000000000
#> [6,] 0.021381735 0.025293286 0.028561934 0.028223453 0.021167517
print(head(JostD(J)[,1:5]))
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] 0.00000000 0.04587505 0.04770505 0.04203824 0.02951187
#> [2,] 0.04587505 0.00000000 0.01074703 0.03006600 0.04631657
#> [3,] 0.04770505 0.01074703 0.00000000 0.02502455 0.05470467
#> [4,] 0.04203824 0.03006600 0.02502455 0.00000000 0.07633520
#> [5,] 0.02951187 0.04631657 0.05470467 0.07633520 0.00000000
#> [6,] 0.07121377 0.08817489 0.09803765 0.10082780 0.06960913
We can use the genetic distance matrix to visualize the “spread” of our population in space. This can be done using a PCA plot. It accepts the distance matrix, which PCs we want to include on the graph and how we want to differentiate the points.
This pipeline provides powerful tools for exploring population genetics, offering flexibility to adapt to various datasets and research questions. While this section highlights the main features of the pipeline, further customization may be required for specific analyses. The combination of reproducibility, offline capability, and user control makes this pipeline a valuable resource for genetic studies.