Reduced dimension plotting is one of the essential tools for the
analysis of single cell data. However, as the number of cells/nuclei in
these these plots increases, the usefulness of these plots decreases.
Many cells are plotted on top of each other obscuring information, even
when taking advantage of transparency settings. This package provides
binning strategies of cells/nuclei into hexagon cells. Plotting
summarized information of all cells/nuclei in their respective hexagon
cells presents information without obstructions. The package seemlessly
works with the two most common object classes for the storage of single
cell data; SingleCellExperiment
from the SingleCellExperiment
package and Seurat
from the Seurat package.
In order to demonstrate the capabilities of the schex package, I will
use the a dataset of Peripheral Blood Mononuclear Cells (PBMC) freely
available from 10x Genomics. There are 2,700 single cells that were
sequenced on the Illumina NextSeq 500. This data is handly availabe in
the TENxPBMCData
package.
tenx_pbmc3k <- TENxPBMCData(dataset = "pbmc3k")
#> see ?TENxPBMCData and browseVignettes('TENxPBMCData') for documentation
#> loading from cache
rownames(tenx_pbmc3k) <- uniquifyFeatureNames(
rowData(tenx_pbmc3k)$ENSEMBL_ID,
rowData(tenx_pbmc3k)$Symbol_TENx
)
In the next few sections, I will perform some simple quality control steps including filtering and normalization. I will then calculate various dimension reductions and cluster the data. These steps do by no means constitute comprehensive handling of the data. For a more detailed guide the reader is referred to the following guides:
I filter cells with high mitochondrial content as well as cells with low library size or feature count.
rowData(tenx_pbmc3k)$Mito <- grepl("^MT-", rownames(tenx_pbmc3k))
colData(tenx_pbmc3k) <- cbind(
colData(tenx_pbmc3k),
perCellQCMetrics(tenx_pbmc3k,
subsets = list(Mt = rowData(tenx_pbmc3k)$Mito)
)
)
rowData(tenx_pbmc3k) <- cbind(
rowData(tenx_pbmc3k),
perFeatureQCMetrics(tenx_pbmc3k)
)
tenx_pbmc3k <- tenx_pbmc3k[, !colData(tenx_pbmc3k)$subsets_Mt_percent > 50]
libsize_drop <- isOutlier(tenx_pbmc3k$total,
nmads = 3, type = "lower", log = TRUE
)
feature_drop <- isOutlier(tenx_pbmc3k$detected,
nmads = 3, type = "lower", log = TRUE
)
tenx_pbmc3k <- tenx_pbmc3k[, !(libsize_drop | feature_drop)]
I filter any genes that have 0 count for all cells.
I normalize the data by using a simple library size normalization.
I use both Principal Components Analysis (PCA) and Uniform Manifold Approximation and Projection (UMAP) in order to obtain reduced dimension representations of the data. Since there is a random component in the UMAP, I will also set a seed.
At this stage in the workflow we usually would like to plot aspects of our data in one of the reduced dimension representations. Instead of plotting this in an ordinary fashion, I will demonstrate how schex can provide a better way of plotting this.
First, I will calculate the hexagon cell representation for each cell
for a specified dimension reduction representation. I decide to use
nbins=40
which specifies that I divide my x range into 40
bins. Note that this might be a parameter that you want to play around
with depending on the number of cells/ nuclei in your dataset.
Generally, for more cells/nuclei, nbins
should be
increased.
First I plot how many cells are in each hexagon cell. This should be
relatively even, otherwise change the nbins
parameter in
the previous calculation.
Next I colour the hexagon cells by some meta information, such as the majority of cells cluster membership and the median total count in each hexagon cell.
While for plotting the cluster membership the outcome is not too different from the classic plot, it is much easier to observe differences in the total count.
For convenience there is also a function that allows the calculation
of label positions for factor variables. These can be overlayed with the
package ggrepel
.
Finally, I will visualize the gene expression of the POMGNT1 gene in the hexagon cell representation.
gene_id <- "POMGNT1"
plot_hexbin_feature(tenx_pbmc3k,
type = "logcounts", feature = gene_id,
action = "mean", xlab = "UMAP1", ylab = "UMAP2",
title = paste0("Mean of ", gene_id)
)
Again it is much easier to observe differences in gene expression using the hexagon cell representation than the classic representation.
We can overlay the gene expression data with the clusters for convenience.
schex
output as ggplot
objectsThe schex
packages renders ordinary ggplot
objects and thus these can be treated and manipulated using the ggplot
grammar.
For example the non-data components of the plots can be changed using
the function theme
.
gene_id <- "CD19"
gg <- schex::plot_hexbin_feature(tenx_pbmc3k,
type = "logcounts", feature = gene_id,
action = "mean", xlab = "UMAP1", ylab = "UMAP2",
title = paste0("Mean of ", gene_id)
)
gg + theme_void()
The fact that schex
renders ggplot
objects
can also be used to save these plots. Simply use ggsave
in
order to save any created plot.
To find the details of the session for reproducibility, use this:
sessionInfo()
#> R Under development (unstable) (2024-10-21 r87258)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: America/New_York
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] ggrepel_0.9.6 scran_1.35.0
#> [3] scater_1.35.0 scuttle_1.17.0
#> [5] TENxPBMCData_1.25.0 HDF5Array_1.35.1
#> [7] rhdf5_2.51.0 DelayedArray_0.33.1
#> [9] SparseArray_1.7.1 S4Arrays_1.7.1
#> [11] abind_1.4-8 Matrix_1.7-1
#> [13] igraph_2.1.1 Seurat_5.1.0
#> [15] SeuratObject_5.0.2 sp_2.1-4
#> [17] dplyr_1.1.4 schex_1.21.0
#> [19] SingleCellExperiment_1.29.0 SummarizedExperiment_1.37.0
#> [21] Biobase_2.67.0 GenomicRanges_1.59.0
#> [23] GenomeInfoDb_1.43.0 IRanges_2.41.0
#> [25] S4Vectors_0.45.0 BiocGenerics_0.53.1
#> [27] generics_0.1.3 MatrixGenerics_1.19.0
#> [29] matrixStats_1.4.1 ggplot2_3.5.1
#>
#> loaded via a namespace (and not attached):
#> [1] RcppAnnoy_0.0.22 splines_4.5.0 later_1.3.2
#> [4] filelock_1.0.3 tibble_3.2.1 polyclip_1.10-7
#> [7] fastDummies_1.7.4 lifecycle_1.0.4 edgeR_4.5.0
#> [10] globals_0.16.3 lattice_0.22-6 MASS_7.3-61
#> [13] magrittr_2.0.3 limma_3.63.1 plotly_4.10.4
#> [16] sass_0.4.9 rmarkdown_2.29 jquerylib_0.1.4
#> [19] yaml_2.3.10 metapod_1.15.0 httpuv_1.6.15
#> [22] sctransform_0.4.1 spam_2.11-0 spatstat.sparse_3.1-0
#> [25] reticulate_1.39.0 cowplot_1.1.3 pbapply_1.7-2
#> [28] DBI_1.2.3 RColorBrewer_1.1-3 zlibbioc_1.53.0
#> [31] Rtsne_0.17 purrr_1.0.2 rappdirs_0.3.3
#> [34] tweenr_2.0.3 GenomeInfoDbData_1.2.13 irlba_2.3.5.1
#> [37] listenv_0.9.1 spatstat.utils_3.1-1 goftest_1.2-3
#> [40] RSpectra_0.16-2 dqrng_0.4.1 spatstat.random_3.3-2
#> [43] fitdistrplus_1.2-1 parallelly_1.38.0 leiden_0.4.3.1
#> [46] codetools_0.2-20 ggforce_0.4.2 tidyselect_1.2.1
#> [49] UCSC.utils_1.3.0 farver_2.1.2 viridis_0.6.5
#> [52] ScaledMatrix_1.15.0 BiocFileCache_2.15.0 spatstat.explore_3.3-3
#> [55] jsonlite_1.8.9 BiocNeighbors_2.1.0 progressr_0.15.0
#> [58] ggridges_0.5.6 survival_3.7-0 systemfonts_1.1.0
#> [61] tools_4.5.0 ica_1.0-3 Rcpp_1.0.13-1
#> [64] glue_1.8.0 gridExtra_2.3 xfun_0.49
#> [67] withr_3.0.2 BiocManager_1.30.25 fastmap_1.2.0
#> [70] bluster_1.17.0 rhdf5filters_1.19.0 fansi_1.0.6
#> [73] rsvd_1.0.5 entropy_1.3.1 digest_0.6.37
#> [76] R6_2.5.1 mime_0.12 colorspace_2.1-1
#> [79] scattermore_1.2 tensor_1.5 spatstat.data_3.1-2
#> [82] RSQLite_2.3.7 utf8_1.2.4 tidyr_1.3.1
#> [85] hexbin_1.28.4 data.table_1.16.2 FNN_1.1.4.1
#> [88] httr_1.4.7 htmlwidgets_1.6.4 uwot_0.2.2
#> [91] pkgconfig_2.0.3 gtable_0.3.6 blob_1.2.4
#> [94] lmtest_0.9-40 XVector_0.47.0 htmltools_0.5.8.1
#> [97] dotCall64_1.2 scales_1.3.0 png_0.1-8
#> [100] spatstat.univar_3.1-1 knitr_1.48 reshape2_1.4.4
#> [103] curl_6.0.0 nlme_3.1-166 cachem_1.1.0
#> [106] zoo_1.8-12 stringr_1.5.1 BiocVersion_3.21.1
#> [109] KernSmooth_2.23-24 vipor_0.4.7 parallel_4.5.0
#> [112] miniUI_0.1.1.1 concaveman_1.1.0 AnnotationDbi_1.69.0
#> [115] pillar_1.9.0 grid_4.5.0 vctrs_0.6.5
#> [118] RANN_2.6.2 promises_1.3.0 BiocSingular_1.23.0
#> [121] beachmat_2.23.0 dbplyr_2.5.0 xtable_1.8-4
#> [124] cluster_2.1.6 beeswarm_0.4.0 evaluate_1.0.1
#> [127] locfit_1.5-9.10 cli_3.6.3 compiler_4.5.0
#> [130] rlang_1.1.4 crayon_1.5.3 future.apply_1.11.3
#> [133] labeling_0.4.3 ggbeeswarm_0.7.2 plyr_1.8.9
#> [136] stringi_1.8.4 BiocParallel_1.41.0 viridisLite_0.4.2
#> [139] deldir_2.0-4 Biostrings_2.75.0 munsell_0.5.1
#> [142] lazyeval_0.2.2 spatstat.geom_3.3-3 V8_6.0.0
#> [145] ExperimentHub_2.15.0 RcppHNSW_0.6.0 patchwork_1.3.0
#> [148] bit64_4.5.2 future_1.34.0 Rhdf5lib_1.29.0
#> [151] statmod_1.5.0 KEGGREST_1.47.0 shiny_1.9.1
#> [154] highr_0.11 AnnotationHub_3.15.0 ROCR_1.0-11
#> [157] memoise_2.0.1 bslib_0.8.0 bit_4.5.0