Contents

1 Installation

EnrichDO can be installed from Bioconductor:


if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager")

BiocManager::install("EnrichDO")

or github page


if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager")

library(devtools)
devtools::install_github("liangcheng-hrbmu/EnrichDO")

2 Introduction

Disease Ontology (DO) enrichment analysis is an effective means to discover the associations between genes and diseases. However, most current DO-based enrichment methods were unable to solve the over enriched problem caused by the “true-path” rule. To address this problem, we presents EnrichDO, a double weighted iterative model, which is based on the latest annotations of the human genome with DO terms and integrates the DO graph topology on a global scale. On one hand, to reinforce the saliency of direct gene-DO annotations, different initial weights are assigned to directly annotated genes and indirectly annotated genes, respectively. On the other hand, to detect locally most significant node between the parent and its children, less significant nodes are dynamically down-weighted. EnrichDO exhibits high accuracy that it can identify more specific DO terms, which alleviates the over enriched problem.

EnrichDO encompasses various statistical models and visualization schemes for discovering the associations between genes and diseases from biological big data. Currently uploaded to Bioconductor, EnrichDO aims to provide a more convenient and effective DO enrichment analysis tool.


library(EnrichDO)
#> 

3 Disease Ontology Enrichment Analysis

EnrichDO presents a double weighted iterative model for DO enrichment analysis. Based on the latest annotations of the human genome with DO terms, EnrichDO can identify locally significant enriched terms by applying different initial weights and dynamic weights for annotated genes and integrating the DO graph topology on a global scale. EnrichDO presents an effective and flexible model, which supplies various statistical testing models and multiple testing correction methods.

3.1 Data Preparation

The input data for EnrichDO is a predefined gene set, such as differentially expressed gene sets, significant genes from GWAS, gene sets from high-throughput biological data, etc. The interest gene sets should be protein-coding genes, in the ENTREZID format from NCBI. Taking the input_example as an example, it stores validated protein-coding genes associated with Alzheimer’s disease from the DisGeNET database.


Alzheimer<-read.delim(file.path(system.file("extdata", package="EnrichDO"), 
                                "Alzheimer_curated.csv"), header = FALSE)
input_example<-Alzheimer[,1]

In order to improve the speed of the case, in the following example, several genes (demo.data) are randomly selected from the protein-coding genes (NCBI Entrez ID) for analysis.


demo.data<-c(1636,351,102,2932,3077,348,4137,54209,5663,5328,23621,3416,3553)

3.2 doEnrich Function

EnrichDO provides disease ontology enrichment analysis through the doEnrich function. Depending on the setting of the traditional parameter, you can choose between weighted enrichment analysis or classic enrichment analysis. The following is a simple running example using weighted enrichment analysis by default:


demo_result<-doEnrich(interestGenes=demo.data)

3.2.1 Weighted Enrichment Function

In addition to performing weighted enrichment analysis with default parameters, the following parameters can also be modified to better meet the analysis requirements. The specific meanings of these parameters can be accessed by using the ?doEnrich method.


weighted_demo<-doEnrich(interestGenes=demo.data,
                        test="fisherTest",
                        method="holm",
                        m=1,
                        minGsize=10,
                        maxGsize=2000,
                        delta=0.05,
                        penalize=TRUE)
#>       -- Descending rights test--
#> LEVEL: 13    1 nodes 72 genes to be scored
#> LEVEL: 12    2 nodes 457 genes to be scored
#> LEVEL: 11    3 nodes 907 genes to be scored
#> LEVEL: 10    12 nodes    2278 genes to be scored
#> LEVEL: 9 50 nodes    5376 genes to be scored
#> LEVEL: 8 116 nodes   7751 genes to be scored
#> LEVEL: 7 181 nodes   9463 genes to be scored
#> LEVEL: 6 193 nodes   10144 genes to be scored
#> LEVEL: 5 174 nodes   9756 genes to be scored
#> LEVEL: 4 85 nodes    9088 genes to be scored
#> LEVEL: 3 18 nodes    4599 genes to be scored
#> LEVEL: 2 1 nodes 1605 genes to be scored
#> LEVEL: 1 0 nodes 0 genes to be scored

3.2.2 Classic Enrichment Function

The doEnrich function can control the parameter traditional to perform traditional overexpression enrichment analysis (ORA), which means that it will not down-weight based on the topological structure of the disease ontology. Additionally, Parameters test, method, m, maxGsize and minGsize can be used flexibly.


Tradition_demo<-doEnrich(demo.data, traditional=TRUE)
#>       -- Traditional test--

3.3 Result description and Written

3.3.1 Result description

Running doEnrich will output the terms and total genes involved in each layer of Directed acyclic graph (DAG) to the user. The enrichment results can be summarized using the show function.


show(demo_result)
#> ------------------------- EnrichResult object -------------------------
#> Method of enrichment:
#>   Global Weighted Model
#>   'hypergeomTest' Statistical model with the 'BH' Multiple hypothesis correction
#> Enrichment cutoff layer: 1
#> interestGenes number: 13
#> 231 DOTerms scored: 231 terms with p < 0.01
#> Parameter setting:
#>   Enrichment cutoff layer: 1
#>   Doterm gene number limit: minGsize 5, maxGsize 5000
#>   Enrichment threshold: 0.01

The result of doEnrich is demo_result which contains enrich, interestGenes, test, method, m, maxGsize, minGsize, delta, traditional, penalize. There are 16 columns of enrich, including:

  • The standard ID corresponding to the DO Term (DOID).

  • the standard name of the DO Term (DOTerm), each DO Term has a unique DOID.

  • We constructed a directed acyclic graph according to the is_a relationship between each node in the DO database, and each DO Term has a corresponding level (level).

  • The DO database stores the parent node of each DO Term (parent.arr) and its number (parent.len). For example, “B-cell acute lymphoblastic leukemia” (DOID:0080638) is_a “acute lymphoblastic leukemia” (DOID:9952) and “lymphoma” (DOID:0060058), then the node “B-cell acute lymphoblastic leukemia” is a child of “acute lymphoblastic leukemia” and “lymphoma”, and the child is a more specific biological classification than its parent.

  • child nodes of the DO Term (child.arr) and their number (child.len).

  • the latest GeneRIF information is used to annotate DO Terms, each DO Term has its corresponding disease-associated genes (gene.arr), and its number (gene.len).

  • Assigning a weight to each gene helps assess the contribution of different genes to DO Terms (weight.arr).

  • The smaller the weights of indirectly annotated genes, the less contribution of these genes in the enrichment analysis.(gene.w).

  • the P-value of the DO Term (p), which arranges the order of enrich, and the value of P-value correction (p.adjust).

  • the genes of interest annotated to this DO Term (cg.arr) and its number (cg.len).

  • the number of genes in the interest gene set (ig.len), this represents the number of genes that are actually used for enrichment analysis.

Generally, a significant P value of the enrichment results should be less than 0.05 or 0.01, indicating a significant association between the interesting gene set and the disease node. In the enrich, the node with the most significant enrichment is DOID:0080832, and the DO Term is “mild cognitive impairment”, with its P-value being 9.22e-16. These results suggests that there is statistical significance between the interesting gene set and the DO Term of mild cognitive impairment.

3.3.2 Result writing

The writeResult function can output DOID, DOTerm, p, p.adjust, geneRatio, bgRatio and cg in the data frame enrich as text. The default file name is “result.txt”.


writeResult(EnrichResult = demo_result,file=file.path(tempdir(), "result.txt"), Q=1, P=1)

writeResult has four parameters. EnrichResult indicates the enrichment result of doEnrich, file indicates the write address of a file. The parameter Q (and P) indicates that a DO term is output only when p.adjust (and p value) is less than or equal to Q (and P). The default values for P and Q are 1.

In the output file result.txt, geneRatio represents the intersection of the DO term annotated genes with the interest gene set divided by the interest gene set, and bgRatio represents all genes of the DO term divided by the background gene set.

4 Visualization of enrichment results

EnrichDO provides four methods to visualize enrichment results, including bar plot (drawBarGraph), bubble plot (drawPointGraph), tree plot (drawGraphviz) and heatmap (drawHeatmap), which can show the research results more concisely and intuitively. Pay attention to the threshold setting for each visual method, if the threshold is too low, the display is insufficient.

4.1 drawBarGraph function

drawBarGraph can draw the top n nodes with the most significant p-value as bar chart, and the node’s p-value is less than delta (By default, n is 10 and delta is 1e-15).


drawBarGraph(EnrichResult=demo_result, n=10, delta=0.05)
bar plot

Figure 1: bar plot

4.2 drawPointGraph function

drawPointGraph can draw the top n nodes with the most significant p-value as bubble plot, and the node’s p-value is less than delta (By default, n is 10 and delta is 1e-15).


drawPointGraph(EnrichResult=demo_result, n=10, delta=0.05)
point plot

Figure 2: point plot

4.3 drawGraphViz function

drawGraphViz draws the DAG structure of the most significant n nodes, and labelfontsize can set the font size of labels in nodes (By default, n is 10 and labelfontsize is 14). Each node has a corresponding disease name displayed.

In addition, the drawGraphViz function can also display the P-value of each node in the enrichment analysis (pview=TRUE), and the number of overlapping genes of each DO term and interest gene set (numview=TRUE).


drawGraphViz(EnrichResult=demo_result, n=10, numview=FALSE, pview=FALSE, labelfontsize=17)
tree plot

Figure 3: tree plot

4.4 drawHeatmap function

drawHeatmap function visualizes the strength of the relationship between the top DOID_n nodes from enrichment results and the genes whose weight sum ranks the top gene_n in these nodes. And the gene must be included in the gene of interest. readable indicates whether the gene is displayed as its symbol.

drawHeatmap also provides additional parameters from the pheatmap function, which you can set according to your needs. Default DOID_n is 10, gene_n is 50, fontsize_row is 10, readable is TRUE.


drawHeatmap(interestGenes=demo.data,
            EnrichResult=demo_result,
            gene_n=10,
            fontsize_row=8,
            readable=TRUE)
#> gene symbol conversion result:
#> 
#> 'select()' returned 1:1 mapping between keys and columns
heatmap

Figure 4: heatmap

4.5 convenient drawing

Draw(drawBarGraph ,drawPointGraph ,drawGraphViz) from writeResult output files, so you don’t have to wait for the algorithm to run.

#Firstly, read the wrireResult output file,using the following two lines
data <- read.delim(file.path(system.file("examples", package="EnrichDO"), "result.txt"))
enrich <- convDraw(resultDO=data)
#> The enrichment results you provide are stored in enrich
#> Now you can use the drawing function

#then, Use the drawing function you need
drawGraphViz(enrich=enrich)    #Tree diagram


drawPointGraph(enrich=enrich, delta = 0.05)  #Bubble diagram


drawBarGraph(enrich=enrich, delta = 0.05)    #Bar plot

5 Session information


sessionInfo()
#> R Under development (unstable) (2024-10-21 r87258)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] EnrichDO_1.1.1   BiocStyle_2.35.0
#> 
#> loaded via a namespace (and not attached):
#>   [1] DBI_1.2.3               gson_0.1.0              rlang_1.1.4            
#>   [4] magrittr_2.0.3          DOSE_4.1.0              compiler_4.5.0         
#>   [7] RSQLite_2.3.8           png_0.1-8               vctrs_0.6.5            
#>  [10] reshape2_1.4.4          stringr_1.5.1           pkgconfig_2.0.3        
#>  [13] crayon_1.5.3            fastmap_1.2.0           magick_2.8.5           
#>  [16] XVector_0.47.0          labeling_0.4.3          utf8_1.2.4             
#>  [19] rmarkdown_2.29          enrichplot_1.27.1       graph_1.85.0           
#>  [22] UCSC.utils_1.3.0        tinytex_0.54            purrr_1.0.2            
#>  [25] bit_4.5.0               xfun_0.49               zlibbioc_1.53.0        
#>  [28] cachem_1.1.0            aplot_0.2.3             GenomeInfoDb_1.43.2    
#>  [31] jsonlite_1.8.9          blob_1.2.4              BiocParallel_1.41.0    
#>  [34] parallel_4.5.0          R6_2.5.1                bslib_0.8.0            
#>  [37] stringi_1.8.4           RColorBrewer_1.1-3      jquerylib_0.1.4        
#>  [40] GOSemSim_2.33.0         Rcpp_1.0.13-1           bookdown_0.41          
#>  [43] knitr_1.49              ggtangle_0.0.5          R.utils_2.12.3         
#>  [46] IRanges_2.41.1          igraph_2.1.1            Matrix_1.7-1           
#>  [49] splines_4.5.0           tidyselect_1.2.1        qvalue_2.39.0          
#>  [52] yaml_2.3.10             codetools_0.2-20        lattice_0.22-6         
#>  [55] tibble_3.2.1            plyr_1.8.9              withr_3.0.2            
#>  [58] treeio_1.31.0           Biobase_2.67.0          KEGGREST_1.47.0        
#>  [61] evaluate_1.0.1          gridGraphics_0.5-1      Biostrings_2.75.1      
#>  [64] ggtree_3.15.0           pillar_1.9.0            BiocManager_1.30.25    
#>  [67] stats4_4.5.0            clusterProfiler_4.15.1  ggfun_0.1.7            
#>  [70] generics_0.1.3          S4Vectors_0.45.2        ggplot2_3.5.1          
#>  [73] tidytree_0.4.6          munsell_0.5.1           scales_1.3.0           
#>  [76] glue_1.8.0              lazyeval_0.2.2          pheatmap_1.0.12        
#>  [79] tools_4.5.0             data.table_1.16.2       fgsea_1.33.0           
#>  [82] fs_1.6.5                fastmatch_1.1-4         cowplot_1.1.3          
#>  [85] grid_4.5.0              tidyr_1.3.1             ape_5.8                
#>  [88] AnnotationDbi_1.69.0    colorspace_2.1-1        nlme_3.1-166           
#>  [91] GenomeInfoDbData_1.2.13 patchwork_1.3.0         cli_3.6.3              
#>  [94] fansi_1.0.6             dplyr_1.1.4             Rgraphviz_2.51.0       
#>  [97] gtable_0.3.6            R.methodsS3_1.8.2       yulab.utils_0.1.8      
#> [100] hash_2.2.6.3            sass_0.4.9              digest_0.6.37          
#> [103] BiocGenerics_0.53.3     ggrepel_0.9.6           ggplotify_0.1.2        
#> [106] org.Hs.eg.db_3.20.0     farver_2.1.2            memoise_2.0.1          
#> [109] htmltools_0.5.8.1       R.oo_1.27.0             lifecycle_1.0.4        
#> [112] httr_1.4.7              GO.db_3.20.0            bit64_4.5.2