CytoMethIC Quick Start

5 November 2024

Installation

To use CytoMethIC, you need to install the package from Bioconductor. If you don’t have the BiocManager package installed, install it first:

if (!requireNamespace("BiocManager", quietly = TRUE)) {
  install.packages("BiocManager")
}
if (!requireNamespace("CytoMethIC", quietly = TRUE)) {
  BiocManager::install("CytoMethIC")
}

Introduction

CytoMethIC is a comprehensive package that provides model data and functions for easily using machine learning models that use data from the DNA methylome to classify cancer type and phenotype from a sample. The primary motivation for the development of this package is to abstract away the granular and accessibility-limiting code required to utilize machine learning models in R. Our package provides this abstraction for RandomForest, e1071 Support Vector, Extreme Gradient Boosting, and Tensorflow models. This is paired with an ExperimentHub component, which contains our lab’s models developed for epigenetic cancer classification and predicting phenotypes. This includes CNS tumor classification, Pan-cancer classification, race prediction, cell of origin classification, and subtype classification models.

library(CytoMethIC)
library(ExperimentHub)
library(sesame)
sesameDataCache()

Data from ExperimentHub

For these examples, we’ll be using models from ExperimentHub and a sample from sesameData.

Table: CytoMethIC supported models

ModelID PredictionLabelDescription
rfc_cancertype_TCGA33 TCGA cancer types (N=33)
svm_cancertype_TCGA33 TCGA cancer types (N=33)
xgb_cancertype_TCGA33 TCGA cancer types (N=33)
mlp_cancertype_TCGA33 TCGA cancer types (N=33)
rfc_cancertype_CNS66 CNS Tumor Class (N=66)
svm_cancertype_CNS66 CNS Tumor Class (N=66)
xgb_cancertype_CNS66 CNS Tumor Class (N=66)
mlp_cancertype_CNS66 CNS Tumor Class (N=66)
NA NA
NA NA
NA NA

Pan-Cancer type classification

The below snippet shows a demonstration of the model abstraction working on random forest and support vector models from CytoMethIC models on ExperimentHub.

cmi_predict(sesameDataGet("HM450.1.TCGA.PAAD")$betas, ExperimentHub()[["EH8395"]],
    lift_over=TRUE)
## # A tibble: 1 × 2
##   response  prob
##   <chr>    <dbl>
## 1 PAAD     0.852
cmi_predict(sesameDataGet("HM450.1.TCGA.PAAD")$betas, ExperimentHub()[["EH8396"]],
    lift_over=TRUE)
## # A tibble: 1 × 2
##   response  prob
##   <chr>    <dbl>
## 1 PAAD     0.986

Pan-Cancer subtype classification

The below snippet shows a demonstration of the cmi_predict function working to predict the subtype of the cancer.

cmi_predict(sesameDataGet("HM450.1.TCGA.PAAD")$betas, ExperimentHub()[["EH8422"]])
## # A tibble: 1 × 2
##   response  prob
##   <chr>    <dbl>
## 1 GI.CIN   0.462

Ethnicity classification

The below snippet shows a demonstration of the cmi_predict function working to predict the ethnicity of the patient.

cmi_predict(sesameDataGet("HM450.1.TCGA.PAAD")$betas, ExperimentHub()[["EH8421"]])
## # A tibble: 1 × 2
##   response  prob
##   <chr>    <dbl>
## 1 WHITE    0.886

Pan-Cancer COO classification

The below snippet shows a demonstration of the cmi_predict function working to predict the cell of origin of the cancer.

cmi_predict(sesameDataGet("HM450.1.TCGA.PAAD")$betas, ExperimentHub()[["EH8423"]])
## # A tibble: 1 × 2
##   response                    prob
##   <chr>                      <dbl>
## 1 C20:Mixed (Stromal/Immune) 0.768

Ethnicity From GitHub Link

In addition to ExperimentHub Models, this package also supports using models from GitHub URLs. Note that https://github.com/zhou-lab/CytoMethIC_models will be the most frequently updated public repository of our lab’s classifiers.

base_url <- "https://github.com/zhou-lab/CytoMethIC_models/raw/main/models"
cmi_model <- readRDS(url(sprintf("%s/Race3_rfcTCGA_InfHum3.rds", base_url)))
betas <- openSesame(sesameDataGet("EPICv2.8.SigDF")[[1]])
cmi_predict(betas, cmi_model, lift_over=TRUE)
## # A tibble: 1 × 2
##   response  prob
##   <chr>    <dbl>
## 1 WHITE        1
sessionInfo()
## R Under development (unstable) (2024-10-21 r87258)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] knitr_1.48           sesame_1.25.0        sesameData_1.25.0   
##  [4] CytoMethIC_1.3.0     ExperimentHub_2.15.0 AnnotationHub_3.15.0
##  [7] BiocFileCache_2.15.0 dbplyr_2.5.0         BiocGenerics_0.53.1 
## [10] generics_0.1.3      
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.2.1            dplyr_1.1.4                
##  [3] blob_1.2.4                  filelock_1.0.3             
##  [5] Biostrings_2.75.0           fastmap_1.2.0              
##  [7] digest_0.6.37               mime_0.12                  
##  [9] lifecycle_1.0.4             KEGGREST_1.47.0            
## [11] RSQLite_2.3.7               magrittr_2.0.3             
## [13] compiler_4.5.0              rlang_1.1.4                
## [15] tools_4.5.0                 utf8_1.2.4                 
## [17] yaml_2.3.10                 S4Arrays_1.7.1             
## [19] bit_4.5.0                   curl_5.2.3                 
## [21] DelayedArray_0.33.1         plyr_1.8.9                 
## [23] RColorBrewer_1.1-3          abind_1.4-8                
## [25] BiocParallel_1.41.0         withr_3.0.2                
## [27] purrr_1.0.2                 grid_4.5.0                 
## [29] stats4_4.5.0                preprocessCore_1.69.0      
## [31] fansi_1.0.6                 wheatmap_0.2.0             
## [33] e1071_1.7-16                colorspace_2.1-1           
## [35] ggplot2_3.5.1               MASS_7.3-61                
## [37] scales_1.3.0                SummarizedExperiment_1.37.0
## [39] cli_3.6.3                   rmarkdown_2.29             
## [41] crayon_1.5.3                reshape2_1.4.4             
## [43] httr_1.4.7                  tzdb_0.4.0                 
## [45] proxy_0.4-27                DBI_1.2.3                  
## [47] cachem_1.1.0                stringr_1.5.1              
## [49] zlibbioc_1.53.0             parallel_4.5.0             
## [51] AnnotationDbi_1.69.0        BiocManager_1.30.25        
## [53] XVector_0.47.0              matrixStats_1.4.1          
## [55] vctrs_0.6.5                 Matrix_1.7-1               
## [57] jsonlite_1.8.9              IRanges_2.41.0             
## [59] hms_1.1.3                   S4Vectors_0.45.0           
## [61] bit64_4.5.2                 glue_1.8.0                 
## [63] codetools_0.2-20            stringi_1.8.4              
## [65] gtable_0.3.6                BiocVersion_3.21.1         
## [67] GenomeInfoDb_1.43.0         GenomicRanges_1.59.0       
## [69] UCSC.utils_1.3.0            munsell_0.5.1              
## [71] tibble_3.2.1                pillar_1.9.0               
## [73] rappdirs_0.3.3              htmltools_0.5.8.1          
## [75] randomForest_4.7-1.2        GenomeInfoDbData_1.2.13    
## [77] R6_2.5.1                    lattice_0.22-6             
## [79] evaluate_1.0.1              Biobase_2.67.0             
## [81] readr_2.1.5                 png_0.1-8                  
## [83] memoise_2.0.1               BiocStyle_2.35.0           
## [85] class_7.3-22                Rcpp_1.0.13-1              
## [87] SparseArray_1.7.0           xfun_0.49                  
## [89] MatrixGenerics_1.19.0       pkgconfig_2.0.3