The purpose of this package is to compare the gene expression of the genes in two different conditions. The most typical case is when comparing gene expression in patients with a disease with gene expression in study participants without the disease. Hence, we may construct a network containing genes which are relevant for the development of the disease. The input data may come from different measurements of expression such as microarray, proteomics or RNA-seq as long as:
For differential gene-expression involving more than two separate conditions,
consider CoDiNA
(Morselli Gysi 2020) instead.
This package is hosted on Bioconductor. To install it, type:
if (!requireNamespace("BiocManager", quietly=TRUE))
install.packages("BiocManager")
BiocManager::install("csdR")
Then,
library(csdR)
#>
should load the package into the current R session. For this vignette, we further load some auxiliary packages and set the random seed
suppressPackageStartupMessages({
library(magrittr)
library(igraph)
library(glue)
library(dplyr)
})
set.seed(45394534)
This is a re-implementation and slight modification of the CSD algorithm presented by Voigt et al. 2017(Voigt 2017). In the first phase, the algorithm finds the co-expression between all genes within each condition using the Spearman correlation. For each pair of genes, we apply bootstrapping across the samples and compute the mean Spearman correlation \(\rho_1\) and \(\rho_2\) for the two conditions and the associated standard errors \(\sigma_1\) and \(\sigma_2\). In the second stage, the values for the two conditions are compared and gives us the following differential co-expression scores:
In this example, we are provided by two expression expression matrices
from thyroid glands,
sick_expression
for patients with thyroid cancer
and normal_expression
for healthy controls.
To run the CSD analysis for these two conditions, we simply do the following:
data("sick_expression")
data("normal_expression")
csd_results <- run_csd(
x_1 = sick_expression, x_2 = normal_expression,
n_it = 10, nThreads = 2L, verbose = FALSE
)
After obtaining these results, we may write them to disk. However, for datasets with thousands of genes, we will get millions upon millions of gene pairs. Writing the results to disk is likely to fill up gigabytes of valuable storage space while the disk IO itself might take a considerable amount of time. Furthermore, we must reduce the information load to create meaningful results, so we better to that while the data is still in memory. We decide to select the 100 edges with highest C, S, and D-score.
pairs_to_pick <- 100L
c_filter <- partial_argsort(csd_results$cVal, pairs_to_pick)
c_frame <- csd_results[c_filter, ]
s_filter <- partial_argsort(csd_results$sVal, pairs_to_pick)
s_frame <- csd_results[s_filter, ]
d_filter <- partial_argsort(csd_results$dVal, pairs_to_pick)
d_frame <- csd_results[d_filter, ]
Why does the csdR
package provide a general partial_argsort
function
which takes in a numeric vector
and spits out the indecies of the largest elements
instead of a more specialized function
directly extracting the top results from the dataframe?
The answer is flexibility.
Writing an additional line of code and a dollar sign
is not that much work after all
and we may want more flexible approaches such
as displaying the union of the C, S- and D-edges:
csd_filter <- c_filter %>%
union(s_filter) %>%
union(d_filter)
csd_frame <- csd_results[csd_filter, ]
The next logical step is to construct a network and do some analysis.
This is outside the scope of this package,
but we will provide some pointers for completeness.
One viable approach is to use the ordinary write.table
function
to write the results of a file
and then use an external tools such as Cytoscape to further make conclusions.
Often, you may want to make an ontology enrichment of the genes.
The other option is of course to continue using R. Here, we provide an example of combining the C-, S- and D-networks and coloring the edges blue, green and red, respectively depending of where they come from.
c_network <- graph_from_data_frame(c_frame, directed = FALSE)
s_network <- graph_from_data_frame(s_frame, directed = FALSE)
d_network <- graph_from_data_frame(d_frame, directed = FALSE)
E(c_network)$edge_type <- "C"
E(s_network)$edge_type <- "S"
E(d_network)$edge_type <- "D"
combined_network <- igraph::union(c_network, s_network, d_network)
# Auxillary function for combining
# the attributes of the three networks in a proper way
join_attributes <- function(graph, attribute) {
ifelse(
test = is.na(edge_attr(graph, glue("{attribute}_1"))),
yes = ifelse(
test = is.na(edge_attr(graph, glue("{attribute}_2"))),
yes = edge_attr(graph, glue("{attribute}_3")),
no = edge_attr(graph, glue("{attribute}_2"))
),
no = edge_attr(graph, glue("{attribute}_1"))
)
}
E(combined_network)$edge_type <- join_attributes(combined_network, "edge_type")
layout <- layout_nicely(combined_network)
E(combined_network)$color <- recode(E(combined_network)$edge_type,
C = "darkblue", S = "green", D = "darkred"
)
plot(combined_network, layout = layout,
vertex.size = 3, edge.width = 2, vertex.label.cex = 0.001)
As with any bootstrap procedure the number of iterations represented by
the argument n_it
needs to be sufficiently large
in order to get reproducible results.
What this means is a matter of trial and error.
In general this means that you should re-run the computations
with a different random seed to see whether
the number of bootstrap iterations are sufficient.
Experience has shown that ~ 100 iterations might
be sufficient to reproduce almost the same results in some cases,
whereas in other cases,
especially when the values are close,
you may choose to run several thousand iterations.
The parallelization of csdR
occurs within each individual iteration. While parallelizing across the iterations would in theory yield better CPU utilization, this approach is unfeasible due the fact that memory consumption use would almost be proportional to the number of threads. Instead, parallelization is used in computing the co-expression matrix and updating the mean and variance for each iteration. As shown in the csdR
paper, parallelism can provide 2x-3x speedup, but there are diminishing returns in the performance gained, so running more than 10 threads is most likely a waste a resources. Limited memory bandwidth is the most likely culprit behind the failure to scale, so systems with higher memory bandwidth are more likely to utilize multiple threads better.
For datasets with 20 000 to 30 000 genes, a considerable amount of memory is consumed during the computations. It it therefore not recommended in such cases to run CSD on your laptop or even a workstation, but rather on a compute server with several hundreds GB of RAM.
How many gene pairs to select depends on the specific needs and how big a network you want to handle. A 10 000 edge network may not be easy to visualize, but quantitative network metrics can still be extracted. Also, generating more edges than necessary usually does not make any major harm as superfluous edges can quickly be filter out afterwards. However, if you select fewer edges than you actually need, you have to re-do all calculations to increase the number.
sessionInfo()
#> R Under development (unstable) (2024-10-21 r87258)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: America/New_York
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] dplyr_1.1.4 glue_1.8.0 igraph_2.1.1 magrittr_2.0.3
#> [5] csdR_1.13.3 BiocStyle_2.35.0
#>
#> loaded via a namespace (and not attached):
#> [1] DBI_1.2.3 gridExtra_2.3 rlang_1.1.4
#> [4] matrixStats_1.4.1 compiler_4.5.0 RSQLite_2.3.7
#> [7] png_0.1-8 vctrs_0.6.5 stringr_1.5.1
#> [10] pkgconfig_2.0.3 crayon_1.5.3 fastmap_1.2.0
#> [13] magick_2.8.5 backports_1.5.0 XVector_0.47.0
#> [16] utf8_1.2.4 rmarkdown_2.29 UCSC.utils_1.3.0
#> [19] preprocessCore_1.69.0 tinytex_0.54 bit_4.5.0
#> [22] xfun_0.49 zlibbioc_1.53.0 cachem_1.1.0
#> [25] GenomeInfoDb_1.43.0 jsonlite_1.8.9 blob_1.2.4
#> [28] highr_0.11 parallel_4.5.0 cluster_2.1.6
#> [31] R6_2.5.1 bslib_0.8.0 stringi_1.8.4
#> [34] rpart_4.1.23 jquerylib_0.1.4 Rcpp_1.0.13-1
#> [37] bookdown_0.41 iterators_1.0.14 knitr_1.48
#> [40] WGCNA_1.73 base64enc_0.1-3 IRanges_2.41.0
#> [43] Matrix_1.7-1 splines_4.5.0 nnet_7.3-19
#> [46] tidyselect_1.2.1 rstudioapi_0.17.1 yaml_2.3.10
#> [49] doParallel_1.0.17 codetools_0.2-20 lattice_0.22-6
#> [52] tibble_3.2.1 Biobase_2.67.0 KEGGREST_1.47.0
#> [55] evaluate_1.0.1 foreign_0.8-87 survival_3.7-0
#> [58] Biostrings_2.75.1 pillar_1.9.0 BiocManager_1.30.25
#> [61] checkmate_2.3.2 foreach_1.5.2 stats4_4.5.0
#> [64] generics_0.1.3 S4Vectors_0.45.0 ggplot2_3.5.1
#> [67] munsell_0.5.1 scales_1.3.0 RhpcBLASctl_0.23-42
#> [70] Hmisc_5.2-0 tools_4.5.0 data.table_1.16.2
#> [73] fastcluster_1.2.6 grid_4.5.0 impute_1.81.0
#> [76] AnnotationDbi_1.69.0 colorspace_2.1-1 GenomeInfoDbData_1.2.13
#> [79] htmlTable_2.4.3 Formula_1.2-5 cli_3.6.3
#> [82] fansi_1.0.6 gtable_0.3.6 dynamicTreeCut_1.63-1
#> [85] sass_0.4.9 digest_0.6.37 BiocGenerics_0.53.1
#> [88] htmlwidgets_1.6.4 memoise_2.0.1 htmltools_0.5.8.1
#> [91] lifecycle_1.0.4 httr_1.4.7 GO.db_3.20.0
#> [94] bit64_4.5.2