HDF5Array 1.35.5
The aim of this short document is to measure the performance of the HDF5Array package for normalization and PCA, two operations commomly involved in the context of single cell analysis.
The goal is to facilitate comparison with other tools like Seurat and Scanpy.
Let’s install and load HDF5Array as well as the other packages used in this vignette:
if (!require("BiocManager", quietly=TRUE))
install.packages("BiocManager")
pkgs <- c("HDF5Array", "ExperimentHub", "DelayedMatrixStats", "RSpectra")
BiocManager::install(pkgs)
Load the package:
library(HDF5Array)
library(ExperimentHub)
library(DelayedMatrixStats)
library(RSpectra)
The datasets that we will use in this document are subsets of the 1.3 Million Brain Cell Dataset from 10x Genomics.
The 1.3 Million Brain Cell Dataset is a 27,998 x 1,306,127 matrix of counts with one gene per row and one cell per column. It’s available via the ExperimentHub package in two forms, one that uses a sparse representation and one that uses a dense representation:
hub <- ExperimentHub()
hub["EH1039"]$description # sparse representation
## [1] "Single-cell RNA-seq data for 1.3 million brain cells from E18 mice. 'HDF5-based 10X Genomics' format originally provided by TENx Genomics"
hub["EH1040"]$description # dense representation
## [1] "Single-cell RNA-seq data for 1.3 million brain cells from E18 mice. Full rectangular, block-compressed format, 1GB block size."
The two datasets are big HDF5 files stored on a remote location. Let’s download them to the local ExperimentHub cache if they are not there yet:
## Note that this will be quick if the HDF5 files are already in the
## local ExperimentHub cache. Otherwise, it will take a while
full_sparse_h5 <- hub[["EH1039"]]
full_dense_h5 <- hub[["EH1040"]]
We use the TENxMatrix()
and HDF5Array()
constructors to bring the
sparse and dense datasets in R as DelayedArray derivatives. Note that
this does not load the matrix data in memory.
Bring the sparse dataset in R:
## Use 'h5ls(full_sparse_h5)' to find out the group.
full_sparse <- TENxMatrix(full_sparse_h5, group="mm10")
class(full_sparse)
## [1] "TENxMatrix"
## attr(,"package")
## [1] "HDF5Array"
dim(full_sparse)
## [1] 27998 1306127
Bring the dense dataset in R:
## Use 'h5ls(full_dense_h5)' to find out the name of the dataset.
full_dense <- HDF5Array(full_dense_h5, name="counts")
class(full_dense)
## [1] "HDF5Matrix"
## attr(,"package")
## [1] "HDF5Array"
dim(full_dense)
## [1] 27998 1306127
Note that the dense HDF5 file does not contain the dimnames of the matrix so we manually add them:
dimnames(full_dense) <- dimnames(full_sparse)
For our benchmarks below we’ll use subsets of the 1.3 Million Brain Cell Dataset of increasing sizes: subsets with 12,500 cells, 25,000 cells, 50,000 cells, 100,000 cells, and 200,000 cells:
sparse1 <- full_sparse[ , 1:12500]
dense1 <- full_dense[ , 1:12500]
sparse2 <- full_sparse[ , 1:25000]
dense2 <- full_dense[ , 1:25000]
sparse3 <- full_sparse[ , 1:50000]
dense3 <- full_dense[ , 1:50000]
sparse4 <- full_sparse[ , 1:100000]
dense4 <- full_dense[ , 1:100000]
sparse5 <- full_sparse[ , 1:200000]
dense5 <- full_dense[ , 1:200000]
We’ll use the following code for normalization:
## Keep 1000 most variable genes by default.
simple_normalize <- function(mat, num_variable_genes=1000)
{
stopifnot(length(dim(mat)) == 2, !is.null(rownames(mat)))
mat <- mat[rowSums(mat) > 0, ]
mat <- t(t(mat) * 10000 / colSums(mat))
row_vars <- rowVars(mat)
rv_order <- order(row_vars, decreasing=TRUE)
variable_idx <- head(rv_order, n=num_variable_genes)
mat <- log1p(mat[variable_idx, ])
mat / rowSds(mat)
}
and the following code for PCA:
simple_PCA <- function(mat, k=25)
{
stopifnot(length(dim(mat)) == 2)
row_means <- rowMeans(mat)
Ax <- function(x, args)
(as.numeric(mat %*% x) - row_means * sum(x))
Atx <- function(x, args)
(as.numeric(x %*% mat) - as.vector(row_means %*% x))
RSpectra::svds(Ax, Atrans=Atx, k=k, dim=dim(mat))
}
Note that the implementations of simple_normalize()
and simple_PCA()
are expected to work on any matrix-like object regardless of its exact
type/representation e.g. it can be an ordinary matrix, a SparseMatrix
object from the SparseArray package, a dgCMatrix object
from the Matrix package, a DelayedMatrix derivative
(TENxMatrix, HDF5Matrix, TileDBMatrix), etc…
However, when the input is a DelayedMatrix object or derivative, it’s important to be aware that:
Summarization methods like sum()
, colSums()
, rowVars()
, or rowSds()
,
and matrix multiplication (%*%
), are block-processed operations.
The block size is 100 Mb by default. Increasing or decreasing the block size will increase or decrease the memory usage of block-processed operations.
The block size can be controlled with DelayedArray::getAutoBlockSize()
and DelayedArray::setAutoBlockSize()
.
For out benchmarks below, we’ll use the following block sizes: - normalization of the sparse datasets: 250 Mb - normalization of the dense datasets: 100 Mb - PCA on the normalized sparse datasets: 100 Mb - PCA on the normalized dense datasets: 100 Mb
While manually running our benchmarks below on a Linux system, we also monitor memory usage at the command line in a terminal with:
(while true; do ps u -p <PID>; sleep 1; done) >ps.log 2>&1 &
where <PID>
was the process id of our R session. This allowed us to measure
the maximum amount of memory used by the calls to simple_normalize()
or simple_PCA()
.
In this section we run simple_normalize()
on the two smaller test
datasets only (27,998 x 12,500 and 27,998 x 25,000, sparse and dense),
and we report timings and memory usage.
See Timings observed on various systems section at the end of this
document for simple_normalize()
and simple_pca()
times observed
on all our test datasets on various systems.
DelayedArray::setAutoBlockSize(2.5e8) # blocks of 250 Mb
## automatic block size set to 2.5e+08 bytes (was 1e+08)
dim(sparse1)
## [1] 27998 12500
system.time(sparse1n <- simple_normalize(sparse1))
## user system elapsed
## 92.018 7.257 102.762
gc()
## used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
## Ncells 9699687 518.1 17603600 940.2 NA 12972297 692.8
## Vcells 25320413 193.2 98936025 754.9 98304 128739616 982.3
Saving the normalized dataset to a temporary file for PCA later:
dim(sparse1n)
sparse1n_path <- tempfile()
writeTENxMatrix(sparse1n, sparse1n_path, group="matrix", level=0)
dim(sparse2)
## [1] 27998 25000
system.time(sparse2n <- simple_normalize(sparse2))
## user system elapsed
## 151.444 16.675 183.243
gc()
## used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
## Ncells 9718457 519.1 17603600 940.2 NA 12972297 692.8
## Vcells 25476720 194.4 171253616 1306.6 98304 210854634 1608.7
Saving the normalized dataset to a temporary file for PCA later:
dim(sparse2n)
sparse2n_path <- tempfile()
writeTENxMatrix(sparse2n, sparse2n_path, group="matrix", level=0)
With this block size (250 Mb), memory usage (as reported by Unix
command ps u -p <PID>
, see Monitoring memory usage above in
this document) remained < 3.7 Gb at all time.
DelayedArray::setAutoBlockSize(1e8) # blocks of 100 Mb
## automatic block size set to 1e+08 bytes (was 2.5e+08)
dim(dense1)
## [1] 27998 12500
system.time(dense1n <- simple_normalize(dense1))
## user system elapsed
## 87.830 11.741 130.083
gc()
## used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
## Ncells 9718796 519.1 17603600 940.2 NA 12972297 692.8
## Vcells 25550907 195.0 95506509 728.7 98304 214103341 1633.5
Saving the normalized dataset to a temporary file for PCA:
dim(dense1n)
dense1n_path <- tempfile()
writeHDF5Array(dense1n, dense1n_path, name="normalized_counts", level=0)
dim(dense2)
## [1] 27998 25000
system.time(dense2n <- simple_normalize(dense2))
## user system elapsed
## 177.011 19.953 200.542
gc()
## used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
## Ncells 9721913 519.3 17603600 940.2 NA 12972297 692.8
## Vcells 25646476 195.7 91674138 699.5 98304 214103341 1633.5
Saving the normalized dataset to a temporary file for PCA:
dim(dense2n)
dense2n_path <- tempfile()
writeHDF5Array(dense2n, dense2n_path, name="normalized_counts", level=0)
With this block size (100 Mb), memory usage (as reported by Unix
command ps u -p <PID>
, see Monitoring memory usage above in
this document) remained < 2.8 Gb at all time.
In this section we run simple_pca()
on the two normalized datasets
obtained in the previous section (1000 x 12,500 and 1000 x 25,000,
sparse and dense), and we report timings and memory usage.
See Timings observed on various systems section at the end of this
document for simple_normalize()
and simple_pca()
times observed
on all our test datasets on various systems.
DelayedArray::setAutoBlockSize(1e8) # blocks of 100 Mb
## automatic block size set to 1e+08 bytes (was 1e+08)
sparse1n <- TENxMatrix(sparse1n_path)
dim(sparse1n)
## [1] 1000 12500
system.time(pca1s <- simple_PCA(sparse1n))
## user system elapsed
## 230.495 34.416 267.271
gc()
## used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
## Ncells 9728208 519.6 17603600 940.2 NA 12972297 692.8
## Vcells 25943858 198.0 73339311 559.6 98304 214103341 1633.5
sparse2n <- TENxMatrix(sparse2n_path)
dim(sparse2n)
## [1] 1000 25000
system.time(pca2s <- simple_PCA(sparse2n))
## user system elapsed
## 172.955 98.223 273.840
gc()
## used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
## Ncells 9728232 519.6 17603600 940.2 NA 12972297 692.8
## Vcells 26544298 202.6 88087173 672.1 98304 214103341 1633.5
With this block size (100 Mb), memory usage (as reported by Unix
command ps u -p <PID>
, see Monitoring memory usage above in
this document) remained < 2.4 Gb at all time.
DelayedArray::setAutoBlockSize(1e8) # blocks of 100 Mb
## automatic block size set to 1e+08 bytes (was 1e+08)
dense1n <- HDF5Array(dense1n_path, name="normalized_counts")
dim(dense1n)
## [1] 1000 12500
system.time(pca1d <- simple_PCA(dense1n))
## user system elapsed
## 80.334 47.454 132.628
gc()
## used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
## Ncells 9729118 519.6 17603600 940.2 NA 12972297 692.8
## Vcells 26808759 204.6 95823201 731.1 98304 214103341 1633.5
dense2n <- HDF5Array(dense2n_path, name="normalized_counts")
dim(dense2n)
## [1] 1000 25000
system.time(pca2d <- simple_PCA(dense2n))
## user system elapsed
## 150.527 74.329 230.818
gc()
## used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
## Ncells 9728514 519.6 17603600 940.2 NA 12972297 692.8
## Vcells 27370111 208.9 94380468 720.1 98304 214103341 1633.5
With this block size (100 Mb), memory usage (as reported by Unix
command ps u -p <PID>
, see Monitoring memory usage above in
this document) remained < 2.7 Gb at all time.
stopifnot(all.equal(pca1s, pca1d))
stopifnot(all.equal(pca2s, pca2d))
Here we report simple_normalize()
and simple_pca()
times observed
on all our test datasets on various systems.
sparse (TENxMatrix) block size = 250 Mb |
dense (HDF5Matrix) block size = 100 Mb |
|||||
---|---|---|---|---|---|---|
object dimensions | object name | time (seconds) |
max. memory used |
object name | time (seconds) |
max. memory used |
27,998 x 12,500 |
sparse1
|
40.7 |
dense1
|
52.8 | ||
27,998 x 25,000 |
sparse2
|
85.2 |
dense2
|
111.4 | ||
27,998 x 50,000 |
sparse3
|
172.7 |
dense3
|
223.3 | ||
27,998 x 100,000 |
sparse4
|
346.6 |
dense4
|
456.4 | ||
27,998 x 200,000 |
sparse5
|
742.0 |
dense5
|
942.5 |
sparse (TENxMatrix) block size = 100 Mb |
dense (HDF5Matrix) block size = 100 Mb |
|||||
---|---|---|---|---|---|---|
object dimensions | object name | time (seconds) |
max. memory used |
object name | time (seconds) |
max. memory used |
1000 x 12,500 |
sparse1n
|
47.9 |
dense1n
|
42.2 | ||
1000 x 25,000 |
sparse2n
|
70.0 |
dense2n
|
78.9 | ||
1000 x 50,000 |
sparse3n
|
152.3 |
dense3n
|
176.4 | ||
1000 x 100,000 |
sparse4n
|
289.8 |
dense4n
|
458.7 | ||
1000 x 200,000 |
sparse5n
|
637.4 |
dense5n
|
867.4 |
Timings coming soon…
Timings coming soon…
sparse (TENxMatrix) block size = 250 Mb |
dense (HDF5Matrix) block size = 100 Mb |
|||||
---|---|---|---|---|---|---|
object dimensions | object name | time (seconds) |
max. memory used |
object name | time (seconds) |
max. memory used |
27,998 x 12,500 |
sparse1
|
33.6 |
dense1
|
35.3 | ||
27,998 x 25,000 |
sparse2
|
67.4 |
dense2
|
74.2 | ||
27,998 x 50,000 |
sparse3
|
140.3 |
dense3
|
148.1 | ||
27,998 x 100,000 |
sparse4
|
279.9 |
dense4
|
305.4 | ||
27,998 x 200,000 |
sparse5
|
608.1 |
dense5
|
617.8 |
sparse (TENxMatrix) block size = 100 Mb |
dense (HDF5Matrix) block size = 100 Mb |
|||||
---|---|---|---|---|---|---|
object dimensions | object name | time (seconds) |
max. memory used |
object name | time (seconds) |
max. memory used |
1000 x 12,500 |
sparse1n
|
33.7 |
dense1n
|
30.0 | ||
1000 x 25,000 |
sparse2n
|
58.0 |
dense2n
|
56.1 | ||
1000 x 50,000 |
sparse3n
|
117.5 |
dense3n
|
127.9 | ||
1000 x 100,000 |
sparse4n
|
255.1 |
dense4n
|
372.6 | ||
1000 x 200,000 |
sparse5n
|
570.8 |
dense5n
|
677.1 |
The sparse representation (TENxMatrix) seems to perform slightly better than the dense representation (HDF5Matrix) when it comes to normalization/PCA of single cell data. Also the gap in performance between sparse and dense tends to slightly increase with the size of the dataset.
Normalization and PCA are very roughly linear in time, regardless of representation (sparse or dense).
Normalization and PCA both perform at almost constant memory, regardless of representation (sparse or dense).
sessionInfo()
## R Under development (unstable) (2024-11-20 r87352)
## Platform: x86_64-apple-darwin20
## Running under: macOS Monterey 12.7.6
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.5-x86_64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
##
## locale:
## [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] TENxBrainData_1.27.0 SingleCellExperiment_1.29.1
## [3] SummarizedExperiment_1.37.0 Biobase_2.67.0
## [5] GenomicRanges_1.59.1 GenomeInfoDb_1.43.2
## [7] RSpectra_0.16-2 DelayedMatrixStats_1.29.1
## [9] ExperimentHub_2.15.0 AnnotationHub_3.15.0
## [11] BiocFileCache_2.15.0 dbplyr_2.5.0
## [13] HDF5Array_1.35.5 rhdf5_2.51.2
## [15] DelayedArray_0.33.3 SparseArray_1.7.3
## [17] S4Arrays_1.7.1 IRanges_2.41.2
## [19] abind_1.4-8 S4Vectors_0.45.2
## [21] MatrixGenerics_1.19.1 matrixStats_1.5.0
## [23] BiocGenerics_0.53.3 generics_0.1.3
## [25] Matrix_1.7-1 BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] KEGGREST_1.47.0 xfun_0.50 bslib_0.8.0
## [4] lattice_0.22-6 rhdf5filters_1.19.0 vctrs_0.6.5
## [7] tools_4.5.0 curl_6.1.0 tibble_3.2.1
## [10] AnnotationDbi_1.69.0 RSQLite_2.3.9 blob_1.2.4
## [13] pkgconfig_2.0.3 sparseMatrixStats_1.19.0 lifecycle_1.0.4
## [16] GenomeInfoDbData_1.2.13 compiler_4.5.0 Biostrings_2.75.3
## [19] htmltools_0.5.8.1 sass_0.4.9 yaml_2.3.10
## [22] pillar_1.10.1 crayon_1.5.3 jquerylib_0.1.4
## [25] cachem_1.1.0 mime_0.12 tidyselect_1.2.1
## [28] digest_0.6.37 purrr_1.0.2 dplyr_1.1.4
## [31] bookdown_0.42 BiocVersion_3.21.1 fastmap_1.2.0
## [34] grid_4.5.0 cli_3.6.3 magrittr_2.0.3
## [37] withr_3.0.2 filelock_1.0.3 UCSC.utils_1.3.1
## [40] rappdirs_0.3.3 bit64_4.5.2 rmarkdown_2.29
## [43] XVector_0.47.2 httr_1.4.7 bit_4.5.0.1
## [46] png_0.1-8 memoise_2.0.1 evaluate_1.0.3
## [49] knitr_1.49 rlang_1.1.4 Rcpp_1.0.14
## [52] glue_1.8.0 DBI_1.2.3 BiocManager_1.30.25
## [55] jsonlite_1.8.9 R6_2.5.1 Rhdf5lib_1.29.0