Genetic simulations are an important part of molecular biology. They are
useful for assessing the efficiency and the sensitivity of models of
evolution. Despite their relevance, hardly any simulator dedicated to
sequence generation for natural selection inference analyses exist on the
Bioconductor platform. In the broader molecular evolution genre, existing
genetic simulators are yet to fully appreciate the correspondence between
the population genomic and the phylogenetic literature. scoup
is
designed as a contribution toward filling these voids. With scoup
, it
is possible to explore the implications of the interplay between mutation
and genetic drift on the phenomenological inferences of natural selection
obtained from phylogenetic models. The ratio of the variance of
non-synonymous to synonymous selection coefficient (vN/vS) is also
presented as a reliable selection discriminant metric. Example code of
how to use the package are presented with elaborate comments.
scoup 1.1.4
Statistical inference of the extent at which Darwinian natural selection had
impacted on genetic data from multiple populations commands a healthy portion
of the phylogenetic literature (Jacques et al. 2023). Validation of these
codon-based models relies heavily on simulated data. A search of the entries
on the Bioconductor (Gentleman et al. 2004) platform, on 29 July 2024, with keywords
codon
, mutation
, selection
, simulate
and simulation
returned a total
of 72 unique (out of the 2300 available Software) packages. None of the
retrieved entries was dedicated to codon data simulation for natural selection
analyses. Given the ever increasing diversity natural selection inference
models that exist, there is indeed need for applicable packages on the
platform.
Population genomic studies provided the mathematical foundation upon which
phylogenetics thrived (Wright 1931; Fisher 1922; Hardy 1908; Weinberg 1908; Darwin 1859). The thirst to bridge the gap between these two genres of
evolutionary biology continue to drive the invention of more complex models
of evolution (Aris-Brosou and Rodrigue 2012). Consequently, there is need to develop codon
sequence simulators to match the growth. scoup
is designed on the
basis of the mutation-selection (MutSel) framework (Halpern and Bruno 1998) as a
contribution to this quest. Only a couple of existing selection-focused
simulators in the literature used the MutSel framework (Spielman and Wilke 2015; Arenas and Posada 2014). This is most probably due to the perceived complexity of the
methodology. In scoup
, the versatility of the Uhlenbeck and Ornstein (1930)
algorithm as a framework for evolutionary analyses (Bartoszek et al. 2017) was
used to circumvent the complexity.
In a bid to identify an appropriate quantifier that permits direct comparison
between the degree of selection signatures imposed during simulation and that
inferred, the ratio of the variance of the non-synonymous to synonymous
selection coefficients (vN/vS) was discovered to be appropriate. The vN/vS
statistic is consequently posited as a quality selection discriminant metric.
scoup
therefore represents an important contribution to the
phylogenetic modelling literature. Example code of how to successfully use
the package is presented below.
Use the following code to install scoup
from the Bioconductor platform.
if(!requireNamespace("BiocManager", quietly=TRUE))
install.packages("BiocManager")
BiocManager::install("scoup")
Three primary evolutionary algorithms are available in scoup
. These
include the frequency-dependent (Jones et al. 2017; Ayala and Campbell 1974), the
Ornstein-Uhlenbeck (OU) and the episodic algorithms. Example of R
(R Core Team 2024) code where these functions are utilised are presented.
The homogeneous (Muse and Gaut 1994; Goldman and Yang 1994) and heterogeneous (site-wise
and branch-wise) (Nielsen and Yang 1998; Kosakovsky Pond and Frost 2005) selection inference modelling
contexts are explored. Data quality is assessed by comparing the
maximum likelihood inferred (\(\omega\)) and the analytically calculated
(\(\mathrm{d}N/\mathrm{d}S\)) estimates to the magnitude of the imposed
selection pressure (measured by vN/vS). Template code used to analyse
the output data (to obtain \(\omega\)) with PAML
(Álvarez-Carretero, Kapli, and Yang 2023; Yang 2007)
and FUBAR
(Murrell et al. 2013; Kosakovsky Pond, Frost, and Muse 2005) are presented in the Appendix. The
R
code presented subsequently, require that the user should have already
installed the scoup
package.
# Make package accessible in R session
library(scoup)
## Loading required package: Matrix
# Number of extant taxa
leaves <- 8
# Number of codon sites
sSize <- 15
# Number of data replications for each parameter combination
sims <- 1
# OU reversion parameter (Theta) value
eThta <- c(0.01)
# OU asymptotic variance value
eVary <- c(0.0001)
# OU landscape shift parameters
hbrunoStat <- hbInput(c(vNvS=1, nsynVar=0.01))
# Sequence alignment size information
seqStat <- seqDetails(c(nsite=sSize, ntaxa=leaves))
# Iterate over all listed OU variance values
for(g in seq(1,length(eVary))){
# Iterate over all listed OU reversion parameter values
for(h in seq(1,length(eThta))){
# Create appropriate simulation function ("ou") object
adaptStat <- ouInput(c(eVar=eVary[g],Theta=eThta[h]))
# Iterate over the specified number of replicates
for(i in seq(1,sims)){
# Execute simulation
simData <- alignsim(adaptStat, seqStat, hbrunoStat, NULL)
}
}
}
# Print simulated alignment
seqCOL(simData)
## DNAStringSet object of length 8:
## width seq names
## [1] 45 AAGCGTTTCGACTATAGCGGTAACCACCTCTACGAGGCCGGATGG S001
## [2] 45 AAGCGTTTCGACTATAGCGGTAACCACCTCTACGAGGCCGGATGG S002
## [3] 45 AAGCGTTTCGACTATAGCGGTAACCTCCTCTACGAGGCCGGATGG S003
## [4] 45 AAGCGTTTCGACTATAGCGGTAACCTCCTCTACGAGGCCGAATGG S004
## [5] 45 AAGCGTTTCAGCTATAGCGGTAACCTCCACTACGAGGCCGAATGG S005
## [6] 45 AAGCGTTTCAGCTATAGCGGTAACCTCCACTACGAGGCCGAATGG S006
## [7] 45 AAGCGTTTCAGCTACAGCGGTAACCTCCACTACGAGGCCGAATGG S007
## [8] 45 AAGCGTTTCAGCTACAGCGGTAACCTCCACTACGAGGCCGAATGG S008
As expected, the correlation coefficient estimate was approximately \(0.9974\) when the means of the inferred (\(\omega\)) and the calculated (\(\mathrm{d}N/\mathrm{d}S\)) selection effects were first compared. The correlation estimation included more iterations.
# Make package accessible in R session
library(scoup)
# Number of extant taxa
xtant <- 8
# Number of codon sites
siteSize <- 15
# Number of data replications for each parameter combination
simSize <- 1
# Variance of the non-synonymous selection coefficients
nsynVary <- c(0.001)
# Ratio of the variance of the non-synonymous to synonymous coeff.
vNvSvec <- c(1)
# Sequence alignment size information
seqStat <- seqDetails(c(nsite=siteSize, ntaxa=xtant))
# Iterate over all listed coefficient variance ratios
for(a in seq(1,length(vNvSvec))){
# Iterate over all listed non-synonymous coefficients variance
for(b in seq(1,length(nsynVary))){
# Create appropriate simulation function ("omega") object
adaptData <- wInput(list(vNvS=vNvSvec[a],nsynVar=nsynVary[b]))
# Iterate over the specified number of replicates
for(i in seq(1,simSize)){
# Execute simulation
simulateSeq <- alignsim(adaptData, seqStat, NA)
}
}
}
# Print simulated alignment
cseq(simulateSeq)
## Sequence
## S001 ATCCAAACAGCCTACAAACGATCTAGAGTGAGCCGCCCCGATCTG
## S002 ATCCAAACAGCCGACAAACGATCTAGAGTGAGCCGCCCCGATCTG
## S003 ATCCAAACAGCCTACAAACGATCTAGAGTGAGCCGCCCCGATCTG
## S004 ATCCAAACAGCCTACAAACGATCTAGAGTGAGCCGCCCCGATCTG
## S005 ATCCAAACAGCCTACAAACGATCTAGAGTGAGCCGCCCCGATCTG
## S006 ATCCAAACAGCCTACAAACGATCTAGAGTGAGCCGCCCCGATCTG
## S007 ATCCAAACAGCCTACAAACGATCTAGAGTGAGCCGCCCCGATCTG
## S008 ATCCAAACAGCCTACAAACGATCTAGAGTGAGCCGCCCCGATCTG
Sequences generated with the presented code (with more iterations included) produced strongly correlated selection inferences (correlation coefficient \(\approx 0.9923\)) when the average \(\mathrm{d}N/\mathrm{d}S\) and the \(\omega\) values were compared. This implementation is an example of how to execute the frequency-dependent evolutionary technique with the package.
# Make package accessible in R session
library(scoup)
# Number of codon sites
sitesize<- 15
# Variance of non-synonymous selection coefficients
nsynVary <- 0.01
# Number of extant taxa
## Commented value was used for results presented in article
taxasize <- 8
# Sequence alignment size information
seqsEntry <- seqDetails(c(nsite=sitesize, ntaxa=taxasize))
# Create the applicable ("ou") object for simulation function
## eVar= OU asymptotic variance, Theta=OU reversion parameter
adaptEntry <- ouInput(c(eVar=0.1,Theta=1))
# Ratio of the variance of the non-synonymous to synonymous coeff.
sratio <- c(1e-06)
# Iterate over all listed coefficient variance ratios
for(a0 in seq(1,length(sratio))){
# OU landscape shift parameters
mValues <- hbInput(c(vNvS=sratio[a0], nsynVar=nsynVary))
# Execute simulation
simSeq <- alignsim(adaptEntry, seqsEntry, mValues, NA)
}
# Print simulated codon sequence
cseq(simSeq)
## Sequence
## S001 ACCGACGTTGTGGGGCAATCGCGTCGGGCGTCGGACTTTCTAGAA
## S002 ACCGACGTTATGGAGGAATCGCGCCTGTCGCCTGAGTTTCTAGAA
## S003 AGCTATGGCGTGGGGAACTCGCAATTCAGGCTCGACTTTTTAGGG
## S004 ATCTATGCCGTGGCGGAATCGCATTTGTCGCCCGACTTAATAGAG
## S005 AACTGTACTTTGGGTCAATTGTCCCGGCAGCTCCAGTTTCATTGC
## S006 AGCTGTGTCCTGGATCAATTGCACCGTCAGCTCCAGTTCAACGAC
## S007 AACTATATGATTGGTCAATTGCGCCGGCAACCCCAATTTCACGGC
## S008 ATCTATATCATGGATCAATTGTACCGGCAACCCCAATTTCAAGGC
This is another example of how to call the OU shifting landscape evolutionary approach. The results obtained yielded a pairwise correlation coefficient estimate of approximately \(0.9988\) between the means of \(\mathrm{d}N/\mathrm{d}S\) and \(\omega\). The correlation coefficient estimates were approximately \(0.8123\) and \(0.8305\) when the averages were each compared to vN/vS, respectively.
# Make package accessible in R session
library(scoup)
# Number of internal nodes on the desired balanced tree
iNode <- 3
# Number of required codon sites
siteCount <- 15
# Variance of non-synonymous selection coefficients
nsnV <- 0.01
# Number of data replications for each parameter combination
nsim <- 1
# Ratio of the variance of the non-synonymous to synonymous coeff.
vNvSvec <- c(1e-06)
# Sequence alignment size information
seqsBwise <- seqDetails(c(nsite=siteCount, blength=0.10))
# Iterate over all listed coefficient variance ratios
for(h in seq(1,length(vNvSvec))){
# Iterate over the specified number of replicates
for(i in seq(1,nsim)){
# Create the parameter set applicable at each internal tree node
scInput <- rbind(vNvS=c(rep(0,iNode-1),vNvSvec[h]),
nsynVar=rep(nsnV,iNode))
# Create the applicable ("discrete") object for simulation function
adaptBranch <- discreteInput(list(p02xnodes=scInput))
# Execute simulation
genSeq <- alignsim(adaptBranch, seqsBwise, NULL)
}
}
# Print simulated sequence data
seqCOL(genSeq)
## DNAStringSet object of length 8:
## width seq names
## [1] 45 GCCCTAGCCCACTGTAGGCCCAAAAACGGATCTCTATGGACTTGT S001
## [2] 45 GGCCTAATCTACTGTAGGTCTAAGGACGGATTTCTGGAGACTCGT S002
## [3] 45 GGCTTGACCTATTATCGACTTAAGATAAGATTTCTATGGGCGCGT S003
## [4] 45 GGCTTGGCCGATTGTAGGCCTAAGAACGGATTTCTACGGACGGGT S004
## [5] 45 GTCTTAGCGTACCGTAGGCCTAAGAATGGATTCCTAGGAGTCCGT S005
## [6] 45 ACCTCAGCGCACCGCAGGCCTAAGGATCGATTGCTGGAGACTCGC S006
## [7] 45 GGCTTGGCGTACTGTGGGTTTAAAAATAGGTTCCTATGGACTCGT S007
## [8] 45 GGCCCGGTGTACTGCAGCCCAAAATATGGATACCTAGGGACCCGG S008
The correlation coefficient between the averages of the analytical \(\mathrm{d}N/\mathrm{d}S\) and the inferred \(\omega\) estimates was approximately \(0.9998\), obtained from 50 independent iterations of the code for a set of vN/vS values. The correlation coefficient estimate was approximately \(0.6349\) for vN/vS vs \(\omega\) and \(0.6360\) for vN/vS vs \(\mathrm{d}N/\mathrm{d}S\).
Reference scoup
code are presented to facilitate use of the package.
Although not explicitly presented, it is also possible to generate data
with signatures of directional selection by setting the aaPlus
element
of the wInput
entry of the alignsim
function accordingly. The
capacity of the package is expected to be extended in future versions.
A more appropriate citation will be provided for the package after it has been accepted for publication. In the meantime, to cite this package, use Sadiq, H. 2024. “scoup: Simulate Codon Sequences with Darwinian Selection Incorporated as an Ornstein-Uhlenbeck Process”. R Package. doi:10.18129/B9.bioc.scoup.
CODEML
script executed in PAML
to infer
single alignment-wide \(\omega\) estimates for data sets generated from 50
independent executions of the sensitivity analyses code presented
above. The same CODEML
script was used to analyse data (also 50
replicates) from the episodic analyses code, with the model
entry
replaced with 2
. The scoup
simulated sequence data and
tree are seq.nex
and seq.tre
, respectively.
seqfile = seq.nex
treefile = seq.tre
outfile = seq.out
noisy = 0
verbose = 0
seqtype = 1
ndata = 1
icode = 0
cleandata = 0
model = 0
NSsites = 0
CodonFreq = 3
estFreq = 0
clock = 0
fix_omega = 0
omega = 1e-05
FUBAR
command executed with HyPhy
through the terminal in MacBook. Note that HyPhy
was already installed on the computer. The seq.nex
input is the
scoup
simulated codon sequence data that is saved in the same
NEXUS file with the tree data. The NEXUS file resides in the
working directory.
hyphy fubar --code Universal --alignment seq.nex --tree seq.nex
The output of sessionInfo()
from the computer where this file is generated
is provided below.
## R Under development (unstable) (2024-10-21 r87258)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] scoup_1.1.4 Matrix_1.7-1 BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] crayon_1.5.3 httr_1.4.7 cli_3.6.3
## [4] knitr_1.49 rlang_1.1.4 xfun_0.50
## [7] UCSC.utils_1.3.0 generics_0.1.3 jsonlite_1.8.9
## [10] S4Vectors_0.45.2 Biostrings_2.75.3 htmltools_0.5.8.1
## [13] stats4_4.5.0 sass_0.4.9 rmarkdown_2.29
## [16] grid_4.5.0 evaluate_1.0.1 jquerylib_0.1.4
## [19] fastmap_1.2.0 GenomeInfoDb_1.43.2 IRanges_2.41.2
## [22] yaml_2.3.10 lifecycle_1.0.4 bookdown_0.42
## [25] BiocManager_1.30.25 compiler_4.5.0 XVector_0.47.2
## [28] lattice_0.22-6 digest_0.6.37 R6_2.5.1
## [31] GenomeInfoDbData_1.2.13 bslib_0.8.0 tools_4.5.0
## [34] BiocGenerics_0.53.3 cachem_1.1.0