Contents

1 Introduction

The aim of this short document is to measure the performance of the HDF5Array package for normalization and PCA, two operations commomly involved in the context of single cell analysis.

The goal is to facilitate comparison with other tools like Seurat and Scanpy.

2 Install and load required packages

Let’s install and load HDF5Array as well as the other packages used in this vignette:

if (!require("BiocManager", quietly=TRUE))
    install.packages("BiocManager")

pkgs <- c("HDF5Array", "ExperimentHub", "DelayedMatrixStats", "RSpectra")
BiocManager::install(pkgs)

Load the package:

library(HDF5Array)
library(ExperimentHub)
library(DelayedMatrixStats)
library(RSpectra)

3 The test datasets

3.1 Sparse vs dense representation

The datasets that we will use in this document are subsets of the 1.3 Million Brain Cell Dataset from 10x Genomics.

The 1.3 Million Brain Cell Dataset is a 27,998 x 1,306,127 matrix of counts with one gene per row and one cell per column. It’s available via the ExperimentHub package in two forms, one that uses a sparse representation and one that uses a dense representation:

hub <- ExperimentHub()
hub["EH1039"]$description  # sparse representation
## [1] "Single-cell RNA-seq data for 1.3 million brain cells from E18 mice. 'HDF5-based 10X Genomics' format originally provided by TENx Genomics"
hub["EH1040"]$description  # dense representation
## [1] "Single-cell RNA-seq data for 1.3 million brain cells from E18 mice. Full rectangular, block-compressed format, 1GB block size."

The two datasets are big HDF5 files stored on a remote location. Let’s download them to the local ExperimentHub cache if they are not there yet:

## Note that this will be quick if the HDF5 files are already in the
## local ExperimentHub cache. Otherwise, it will take a while
full_sparse_h5 <- hub[["EH1039"]]
full_dense_h5  <- hub[["EH1040"]]

3.2 TENxMatrix vs HDF5Matrix object

We use the TENxMatrix() and HDF5Array() constructors to bring the sparse and dense datasets in R as DelayedArray derivatives. Note that this does not load the matrix data in memory.

Bring the sparse dataset in R:

## Use 'h5ls(full_sparse_h5)' to find out the group.
full_sparse <- TENxMatrix(full_sparse_h5, group="mm10")
class(full_sparse)
## [1] "TENxMatrix"
## attr(,"package")
## [1] "HDF5Array"
dim(full_sparse)
## [1]   27998 1306127

Bring the dense dataset in R:

## Use 'h5ls(full_dense_h5)' to find out the name of the dataset.
full_dense <- HDF5Array(full_dense_h5, name="counts")
class(full_dense)
## [1] "HDF5Matrix"
## attr(,"package")
## [1] "HDF5Array"
dim(full_dense)
## [1]   27998 1306127

Note that the dense HDF5 file does not contain the dimnames of the matrix so we manually add them:

dimnames(full_dense) <- dimnames(full_sparse)

3.3 Create the test datasets

For our benchmarks below we’ll use subsets of the 1.3 Million Brain Cell Dataset of increasing sizes: subsets with 12,500 cells, 25,000 cells, 50,000 cells, 100,000 cells, and 200,000 cells:

sparse1 <- full_sparse[ , 1:12500]
dense1  <- full_dense[ , 1:12500]

sparse2 <- full_sparse[ , 1:25000]
dense2  <- full_dense[ , 1:25000]

sparse3 <- full_sparse[ , 1:50000]
dense3  <- full_dense[ , 1:50000]

sparse4 <- full_sparse[ , 1:100000]
dense4  <- full_dense[ , 1:100000]

sparse5 <- full_sparse[ , 1:200000]
dense5  <- full_dense[ , 1:200000]

4 Block-processed normalization and PCA

4.1 Code used for normalization and PCA

We’ll use the following code for normalization:

## Keep 1000 most variable genes by default.
simple_normalize <- function(mat, num_variable_genes=1000)
{
    stopifnot(length(dim(mat)) == 2, !is.null(rownames(mat)))
    mat <- mat[rowSums(mat) > 0, ]
    mat <- t(t(mat) * 10000 / colSums(mat))
    row_vars <- rowVars(mat)
    rv_order <- order(row_vars, decreasing=TRUE)
    variable_idx <- head(rv_order, n=num_variable_genes)
    mat <- log1p(mat[variable_idx, ])
    mat / rowSds(mat)
}

and the following code for PCA:

simple_PCA <- function(mat, k=25)
{
    stopifnot(length(dim(mat)) == 2)
    row_means <- rowMeans(mat)
    Ax <- function(x, args)
        (as.numeric(mat %*% x) - row_means * sum(x))
    Atx <- function(x, args)
        (as.numeric(x %*% mat) - as.vector(row_means %*% x))
    RSpectra::svds(Ax, Atrans=Atx, k=k, dim=dim(mat))
}

4.2 Block processing and block size

Note that the implementations of simple_normalize() and simple_PCA() are expected to work on any matrix-like object regardless of its exact type/representation e.g. it can be an ordinary matrix, a SparseMatrix object from the SparseArray package, a dgCMatrix object from the Matrix package, a DelayedMatrix derivative (TENxMatrix, HDF5Matrix, TileDBMatrix), etc…

However, when the input is a DelayedMatrix object or derivative, it’s important to be aware that:

  • Summarization methods like sum(), colSums(), rowVars(), or rowSds(), and matrix multiplication (%*%), are block-processed operations.

  • The block size is 100 Mb by default. Increasing or decreasing the block size will increase or decrease the memory usage of block-processed operations.

  • The block size can be controlled with DelayedArray::getAutoBlockSize() and DelayedArray::setAutoBlockSize().

For out benchmarks below, we’ll use the following block sizes: - normalization of the sparse datasets: 250 Mb - normalization of the dense datasets: 100 Mb - PCA on the normalized sparse datasets: 100 Mb - PCA on the normalized dense datasets: 100 Mb

4.3 Monitoring memory usage

While manually running our benchmarks below on a Linux system, we also monitor memory usage at the command line in a terminal with:

(while true; do ps u -p <PID>; sleep 1; done) >ps.log 2>&1 &

where <PID> was the process id of our R session. This allowed us to measure the maximum amount of memory used by the calls to simple_normalize() or simple_PCA().

5 Normalization benchmarks

In this section we run simple_normalize() on the two smaller test datasets only (27,998 x 12,500 and 27,998 x 25,000, sparse and dense), and we report timings and memory usage.

See Timings observed on various systems section at the end of this document for simple_normalize() and simple_pca() times observed on all our test datasets on various systems.

5.1 Normalizing the sparse datasets

DelayedArray::setAutoBlockSize(2.5e8)  # blocks of 250 Mb
## automatic block size set to 2.5e+08 bytes (was 1e+08)

5.1.1 27,998 x 12,500 sparse dataset

dim(sparse1)
## [1] 27998 12500
system.time(sparse1n <- simple_normalize(sparse1))
##    user  system elapsed 
##  92.018   7.257 102.762
gc()
##            used  (Mb) gc trigger  (Mb) limit (Mb)  max used  (Mb)
## Ncells  9699687 518.1   17603600 940.2         NA  12972297 692.8
## Vcells 25320413 193.2   98936025 754.9      98304 128739616 982.3

Saving the normalized dataset to a temporary file for PCA later:

dim(sparse1n)
sparse1n_path <- tempfile()
writeTENxMatrix(sparse1n, sparse1n_path, group="matrix", level=0)

5.1.2 27,998 x 25,000 sparse dataset

dim(sparse2)
## [1] 27998 25000
system.time(sparse2n <- simple_normalize(sparse2))
##    user  system elapsed 
## 151.444  16.675 183.243
gc()
##            used  (Mb) gc trigger   (Mb) limit (Mb)  max used   (Mb)
## Ncells  9718457 519.1   17603600  940.2         NA  12972297  692.8
## Vcells 25476720 194.4  171253616 1306.6      98304 210854634 1608.7

Saving the normalized dataset to a temporary file for PCA later:

dim(sparse2n)
sparse2n_path <- tempfile()
writeTENxMatrix(sparse2n, sparse2n_path, group="matrix", level=0)

5.1.3 About memory usage

With this block size (250 Mb), memory usage (as reported by Unix command ps u -p <PID>, see Monitoring memory usage above in this document) remained < 3.7 Gb at all time.

5.2 Normalizing the dense datasets

DelayedArray::setAutoBlockSize(1e8)  # blocks of 100 Mb
## automatic block size set to 1e+08 bytes (was 2.5e+08)

5.2.1 27,998 x 12,500 dense dataset

dim(dense1)
## [1] 27998 12500
system.time(dense1n <- simple_normalize(dense1))
##    user  system elapsed 
##  87.830  11.741 130.083
gc()
##            used  (Mb) gc trigger  (Mb) limit (Mb)  max used   (Mb)
## Ncells  9718796 519.1   17603600 940.2         NA  12972297  692.8
## Vcells 25550907 195.0   95506509 728.7      98304 214103341 1633.5

Saving the normalized dataset to a temporary file for PCA:

dim(dense1n)
dense1n_path <- tempfile()
writeHDF5Array(dense1n, dense1n_path, name="normalized_counts", level=0)

5.2.2 27,998 x 25,000 dense dataset

dim(dense2)
## [1] 27998 25000
system.time(dense2n <- simple_normalize(dense2))
##    user  system elapsed 
## 177.011  19.953 200.542
gc()
##            used  (Mb) gc trigger  (Mb) limit (Mb)  max used   (Mb)
## Ncells  9721913 519.3   17603600 940.2         NA  12972297  692.8
## Vcells 25646476 195.7   91674138 699.5      98304 214103341 1633.5

Saving the normalized dataset to a temporary file for PCA:

dim(dense2n)
dense2n_path <- tempfile()
writeHDF5Array(dense2n, dense2n_path, name="normalized_counts", level=0)

5.2.3 About memory usage

With this block size (100 Mb), memory usage (as reported by Unix command ps u -p <PID>, see Monitoring memory usage above in this document) remained < 2.8 Gb at all time.

6 PCA benchmarks

In this section we run simple_pca() on the two normalized datasets obtained in the previous section (1000 x 12,500 and 1000 x 25,000, sparse and dense), and we report timings and memory usage.

See Timings observed on various systems section at the end of this document for simple_normalize() and simple_pca() times observed on all our test datasets on various systems.

6.1 PCA on the normalized sparse datasets

DelayedArray::setAutoBlockSize(1e8)  # blocks of 100 Mb
## automatic block size set to 1e+08 bytes (was 1e+08)

6.1.1 1000 x 12,500 sparse dataset

sparse1n <- TENxMatrix(sparse1n_path)
dim(sparse1n)
## [1]  1000 12500
system.time(pca1s <- simple_PCA(sparse1n))
##    user  system elapsed 
## 230.495  34.416 267.271
gc()
##            used  (Mb) gc trigger  (Mb) limit (Mb)  max used   (Mb)
## Ncells  9728208 519.6   17603600 940.2         NA  12972297  692.8
## Vcells 25943858 198.0   73339311 559.6      98304 214103341 1633.5

6.1.2 1000 x 25,000 sparse dataset

sparse2n <- TENxMatrix(sparse2n_path)
dim(sparse2n)
## [1]  1000 25000
system.time(pca2s <- simple_PCA(sparse2n))
##    user  system elapsed 
## 172.955  98.223 273.840
gc()
##            used  (Mb) gc trigger  (Mb) limit (Mb)  max used   (Mb)
## Ncells  9728232 519.6   17603600 940.2         NA  12972297  692.8
## Vcells 26544298 202.6   88087173 672.1      98304 214103341 1633.5

6.1.3 About memory usage

With this block size (100 Mb), memory usage (as reported by Unix command ps u -p <PID>, see Monitoring memory usage above in this document) remained < 2.4 Gb at all time.

6.2 PCA on the normalized dense datasets

DelayedArray::setAutoBlockSize(1e8)  # blocks of 100 Mb
## automatic block size set to 1e+08 bytes (was 1e+08)

6.2.1 1000 x 12,500 dense dataset

dense1n <- HDF5Array(dense1n_path, name="normalized_counts")
dim(dense1n)
## [1]  1000 12500
system.time(pca1d <- simple_PCA(dense1n))
##    user  system elapsed 
##  80.334  47.454 132.628
gc()
##            used  (Mb) gc trigger  (Mb) limit (Mb)  max used   (Mb)
## Ncells  9729118 519.6   17603600 940.2         NA  12972297  692.8
## Vcells 26808759 204.6   95823201 731.1      98304 214103341 1633.5

6.2.2 1000 x 25,000 dense dataset

dense2n <- HDF5Array(dense2n_path, name="normalized_counts")
dim(dense2n)
## [1]  1000 25000
system.time(pca2d <- simple_PCA(dense2n))
##    user  system elapsed 
## 150.527  74.329 230.818
gc()
##            used  (Mb) gc trigger  (Mb) limit (Mb)  max used   (Mb)
## Ncells  9728514 519.6   17603600 940.2         NA  12972297  692.8
## Vcells 27370111 208.9   94380468 720.1      98304 214103341 1633.5

6.2.3 About memory usage

With this block size (100 Mb), memory usage (as reported by Unix command ps u -p <PID>, see Monitoring memory usage above in this document) remained < 2.7 Gb at all time.

6.3 Sanity checks

stopifnot(all.equal(pca1s, pca1d))
stopifnot(all.equal(pca2s, pca2d))

7 Timings observed on various systems

Here we report simple_normalize() and simple_pca() times observed on all our test datasets on various systems.

7.1 DELL XPS 15 laptop (model 9520)

  • RAM: 32GB
  • Disk: 1TB SSD
  • OS: Linux Ubuntu 24.04
  • Bioconductor/R versions: 3.21/4.5

7.1.1 Normalization

sparse
(TENxMatrix)
block size = 250 Mb
dense
(HDF5Matrix)
block size = 100 Mb
object dimensions object name time (seconds) max. memory
used
object name time (seconds) max. memory
used
27,998 x 12,500 sparse1 40.7 dense1 52.8
27,998 x 25,000 sparse2 85.2 dense2 111.4
27,998 x 50,000 sparse3 172.7 dense3 223.3
27,998 x 100,000 sparse4 346.6 dense4 456.4
27,998 x 200,000 sparse5 742.0 dense5 942.5

7.1.2 PCA

sparse
(TENxMatrix)
block size = 100 Mb
dense
(HDF5Matrix)
block size = 100 Mb
object dimensions object name time (seconds) max. memory
used
object name time (seconds) max. memory
used
1000 x 12,500 sparse1n 47.9 dense1n 42.2
1000 x 25,000 sparse2n 70.0 dense2n 78.9
1000 x 50,000 sparse3n 152.3 dense3n 176.4
1000 x 100,000 sparse4n 289.8 dense4n 458.7
1000 x 200,000 sparse5n 637.4 dense5n 867.4

7.2 DELL PowerEdge R440 Server

  • RAM: 128GB
  • Disk: 1.92TB SSD SATA Mix Use
  • OS: Linux Ubuntu 24.04
  • Bioconductor/R versions: 3.21/4.5

7.2.1 Normalization

Timings coming soon…

7.2.2 PCA

Timings coming soon…

7.3 Mac Pro (Apple M2 Ultra)

  • RAM: 192GB
  • Disk: 2TB SSD
  • OS: macOS 13.7.1
  • Bioconductor/R versions: 3.21/4.5

7.3.1 Normalization

sparse
(TENxMatrix)
block size = 250 Mb
dense
(HDF5Matrix)
block size = 100 Mb
object dimensions object name time (seconds) max. memory
used
object name time (seconds) max. memory
used
27,998 x 12,500 sparse1 33.6 dense1 35.3
27,998 x 25,000 sparse2 67.4 dense2 74.2
27,998 x 50,000 sparse3 140.3 dense3 148.1
27,998 x 100,000 sparse4 279.9 dense4 305.4
27,998 x 200,000 sparse5 608.1 dense5 617.8

7.3.2 PCA

sparse
(TENxMatrix)
block size = 100 Mb
dense
(HDF5Matrix)
block size = 100 Mb
object dimensions object name time (seconds) max. memory
used
object name time (seconds) max. memory
used
1000 x 12,500 sparse1n 33.7 dense1n 30.0
1000 x 25,000 sparse2n 58.0 dense2n 56.1
1000 x 50,000 sparse3n 117.5 dense3n 127.9
1000 x 100,000 sparse4n 255.1 dense4n 372.6
1000 x 200,000 sparse5n 570.8 dense5n 677.1

8 Conclusions

The sparse representation (TENxMatrix) seems to perform slightly better than the dense representation (HDF5Matrix) when it comes to normalization/PCA of single cell data. Also the gap in performance between sparse and dense tends to slightly increase with the size of the dataset.

Normalization and PCA are very roughly linear in time, regardless of representation (sparse or dense).

Normalization and PCA both perform at almost constant memory, regardless of representation (sparse or dense).

9 Session information

sessionInfo()
## R Under development (unstable) (2024-11-20 r87352)
## Platform: x86_64-apple-darwin20
## Running under: macOS Monterey 12.7.6
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.5-x86_64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0
## 
## locale:
## [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/New_York
## tzcode source: internal
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] TENxBrainData_1.27.0        SingleCellExperiment_1.29.1
##  [3] SummarizedExperiment_1.37.0 Biobase_2.67.0             
##  [5] GenomicRanges_1.59.1        GenomeInfoDb_1.43.2        
##  [7] RSpectra_0.16-2             DelayedMatrixStats_1.29.1  
##  [9] ExperimentHub_2.15.0        AnnotationHub_3.15.0       
## [11] BiocFileCache_2.15.0        dbplyr_2.5.0               
## [13] HDF5Array_1.35.5            rhdf5_2.51.2               
## [15] DelayedArray_0.33.3         SparseArray_1.7.3          
## [17] S4Arrays_1.7.1              IRanges_2.41.2             
## [19] abind_1.4-8                 S4Vectors_0.45.2           
## [21] MatrixGenerics_1.19.1       matrixStats_1.5.0          
## [23] BiocGenerics_0.53.3         generics_0.1.3             
## [25] Matrix_1.7-1                BiocStyle_2.35.0           
## 
## loaded via a namespace (and not attached):
##  [1] KEGGREST_1.47.0          xfun_0.50                bslib_0.8.0             
##  [4] lattice_0.22-6           rhdf5filters_1.19.0      vctrs_0.6.5             
##  [7] tools_4.5.0              curl_6.1.0               tibble_3.2.1            
## [10] AnnotationDbi_1.69.0     RSQLite_2.3.9            blob_1.2.4              
## [13] pkgconfig_2.0.3          sparseMatrixStats_1.19.0 lifecycle_1.0.4         
## [16] GenomeInfoDbData_1.2.13  compiler_4.5.0           Biostrings_2.75.3       
## [19] htmltools_0.5.8.1        sass_0.4.9               yaml_2.3.10             
## [22] pillar_1.10.1            crayon_1.5.3             jquerylib_0.1.4         
## [25] cachem_1.1.0             mime_0.12                tidyselect_1.2.1        
## [28] digest_0.6.37            purrr_1.0.2              dplyr_1.1.4             
## [31] bookdown_0.42            BiocVersion_3.21.1       fastmap_1.2.0           
## [34] grid_4.5.0               cli_3.6.3                magrittr_2.0.3          
## [37] withr_3.0.2              filelock_1.0.3           UCSC.utils_1.3.1        
## [40] rappdirs_0.3.3           bit64_4.5.2              rmarkdown_2.29          
## [43] XVector_0.47.2           httr_1.4.7               bit_4.5.0.1             
## [46] png_0.1-8                memoise_2.0.1            evaluate_1.0.3          
## [49] knitr_1.49               rlang_1.1.4              Rcpp_1.0.14             
## [52] glue_1.8.0               DBI_1.2.3                BiocManager_1.30.25     
## [55] jsonlite_1.8.9           R6_2.5.1                 Rhdf5lib_1.29.0