simPIC 1.5.3
simPIC is an R package for simulating single-cell ATAC sequencing (scATAC-seq) data. It provides flexible control over cell-type composition and batch structure. Users can generate synthetic scATAC-seq matrices that reflect realistic scATAC-seq data characteristics, including sparsity, heterogeneity, and technical variation.
simPIC supports simulation of:
scATAC-seq data that mimics key characteristics— library size, peak means, and cell sparsity of real scATAC-seq data.
Multiple cell types with varying proportions and distinct accessibility profiles.
Batch effects that mimic real experimental scenarios.
The output includes peak-level accessibility counts and metadata, which can be used in downstream analysis pipelines. This vignette provides an overview of simPIC’s key functions and demonstrates how to simulate datasets under different experimental designs.
simPIC can be installed from Bioconductor:
BiocManager::install("simPIC")
To install the latest development version from GitHub use:
BiocManager::install(
"sagrikachugh/simPIC",
dependencies = TRUE,
build_vignettes = TRUE
)
To get started with simPIC, follow these two simple steps to simulate a dataset:
Load your existing count matrix (similar to the one you wish to simulate) and estimate parameters using simPICestimate
function
Run the simulation with the simPICsimulate
function to generate a synthetic dataset that mimics the characteristics of your real data.
# Load package
suppressPackageStartupMessages({
library(simPIC)
})
# Load test data
set.seed(567)
counts <- readRDS(system.file("extdata", "test.rds", package = "simPIC"))
# Estimate parameters
est <- simPICestimate(counts)
#> simPIC is:
#> estimating library size parameters...
#> estimating sparsity...
#> estimating peak mean parameters...
#> using weibull distribution for estimating peak mean
# Simulate data using estimated parameters
sim <- simPICsimulate(est)
#> simPIC is:
#> updating parameters...
#> setting up SingleCellExperiment object...
#> Simulating library size...
#> Simulating peak mean...
#> using weibull distribution for simulating peak mean
#> Simulating true counts...
#> Done!!
simPIC recommends using a Paired-Insertion Count (PIC) matrix for optimal utilization of the quantitative information in scATAC-seq data. To convert your own matrix to a PIC format, refer to the PICsnATAC
GitHub vignette for detailed instructions. In brief, you will need three input files:
singlecell.csv
)peaks.bed
)fragment.tsv.gz
)pic_mat <- PIC_counting(
cells = cells,
fragment_tsv_gz_file_location = fragment_tsv_gz_file_location,
peak_sets = peak_sets
)
The core of the simPIC simulation framework is a gamma-Poisson model
, which generates a uniform peak-by-cell count matrix reflecting chromatin accessibility. Mean accessibility values for each peak are drawn from a Weibull
distribution by default, though users can optionally choose from gamma
, lognormal-gamma
, or pareto
distributions to better match the characteristics of their data.
Each cell is assigned an expected library size, simulated using a log-normal
distribution to mimic real datasets. Finally, sparsity is introduced by applying a Bernoulli
distribution to counts simulated using Poisson
distribution, capturing the high sparsity nature of single-cell ATAC-seq data.
All simulation parameters in simPIC are stored in a dedicated simPICcount
object — a class specifically designed to hold settings and metadata for simulating single-cell ATAC-seq data. Let’s create one and explore its structure.
sim.params <- newsimPICcount()
The default values of the simPICcount
object are pre-loaded based on the included test dataset, providing a starting point for simulation.
sim.params
#> An object of class "simPICcount"
#> Slot "nPeaks":
#> [1] 5000
#>
#> Slot "nCells":
#> [1] 700
#>
#> Slot "seed":
#> [1] 77365
#>
#> Slot "default":
#> [1] TRUE
#>
#> Slot "pm.distr":
#> [1] "weibull"
#>
#> Slot "lib.size.meanlog":
#> [1] 6.687082
#>
#> Slot "lib.size.sdlog":
#> [1] 0.344361
#>
#> Slot "peak.mean.shape":
#> [1] 0.7909301
#>
#> Slot "peak.mean.rate":
#> [1] 7.100648
#>
#> Slot "peak.mean.scale":
#> [1] 0.09522228
#>
#> Slot "peak.mean.pi":
#> [1] -17.17441
#>
#> Slot "peak.mean.meanlog":
#> [1] -2.825233
#>
#> Slot "peak.mean.sdlog":
#> [1] -1.366378
#>
#> Slot "sparsity":
#> [1] 0.18901691 0.02974688 0.31947875 0.34352699 0.55482363 0.34194764
#> [7] 0.69342824 0.39277430 0.01497159 0.31065606 0.55345955 0.43382719
#> [13] 0.25510441 0.57139502 0.29426623 0.16502075 0.30144439 0.03273580
#> [19] 0.57669530 0.74481656 0.97554470 0.59718436 0.82872739 0.53715636
#> [25] 0.60640269 0.66897665 0.05996589 0.82303377 0.72567261 0.30723965
#> [ reached 'max' / getOption("max.print") -- omitted 199970 entries ]
#>
#> Slot "nGroups":
#> [1] 1
#>
#> Slot "group.prob":
#> [1] 1
#>
#> Slot "da.prob":
#> [1] 0.13
#>
#> Slot "da.downProb":
#> [1] 0.5
#>
#> Slot "da.facLoc":
#> [1] 0.1
#>
#> Slot "da.facScale":
#> [1] 0.4
#>
#> Slot "bcv.common":
#> [1] 0.1
#>
#> Slot "bcv.df":
#> [1] 60
#>
#> Slot "nBatches":
#> [1] 1
#>
#> Slot "batchCells":
#> [1] 100
#>
#> Slot "batch.facLoc":
#> [1] 0.1
#>
#> Slot "batch.facScale":
#> [1] 0.1
#>
#> Slot "batch.rmEffect":
#> [1] FALSE
To access a specific parameter, such as the number of peaks, the user can use the simPICget
function:
simPICget(sim.params, "nPeaks")
#> [1] 5000
Alternatively, to assign a new value to a parameter, we can use the setsimPICparameters
function:
sim.params <- setsimPICparameters(sim.params, nPeaks = 2000)
simPICget(sim.params, "nPeaks")
#> [1] 2000
The above functions also allows getting or setting multiple parameters simultaneously:
# Set multiple parameters at once (using a list)
sim.params <- setsimPICparameters(sim.params,
update = list(nPeaks = 8000, nCells = 500)
)
# Extract multiple parameters as a list
params <- simPICgetparameters(
sim.params,
c("nPeaks", "nCells", "peak.mean.shape")
)
# Set multiple parameters at once (using additional arguments)
params <- setsimPICparameters(sim.params,
lib.size.sdlog = 3.5, lib.size.meanlog = 9.07
)
params
#> An object of class "simPICcount"
#> Slot "nPeaks":
#> [1] 8000
#>
#> Slot "nCells":
#> [1] 500
#>
#> Slot "seed":
#> [1] 77365
#>
#> Slot "default":
#> [1] TRUE
#>
#> Slot "pm.distr":
#> [1] "weibull"
#>
#> Slot "lib.size.meanlog":
#> [1] 9.07
#>
#> Slot "lib.size.sdlog":
#> [1] 3.5
#>
#> Slot "peak.mean.shape":
#> [1] 0.7909301
#>
#> Slot "peak.mean.rate":
#> [1] 7.100648
#>
#> Slot "peak.mean.scale":
#> [1] 0.09522228
#>
#> Slot "peak.mean.pi":
#> [1] -17.17441
#>
#> Slot "peak.mean.meanlog":
#> [1] -2.825233
#>
#> Slot "peak.mean.sdlog":
#> [1] -1.366378
#>
#> Slot "sparsity":
#> [1] 0.18901691 0.02974688 0.31947875 0.34352699 0.55482363 0.34194764
#> [7] 0.69342824 0.39277430 0.01497159 0.31065606 0.55345955 0.43382719
#> [13] 0.25510441 0.57139502 0.29426623 0.16502075 0.30144439 0.03273580
#> [19] 0.57669530 0.74481656 0.97554470 0.59718436 0.82872739 0.53715636
#> [25] 0.60640269 0.66897665 0.05996589 0.82303377 0.72567261 0.30723965
#> [ reached 'max' / getOption("max.print") -- omitted 199970 entries ]
#>
#> Slot "nGroups":
#> [1] 1
#>
#> Slot "group.prob":
#> [1] 1
#>
#> Slot "da.prob":
#> [1] 0.13
#>
#> Slot "da.downProb":
#> [1] 0.5
#>
#> Slot "da.facLoc":
#> [1] 0.1
#>
#> Slot "da.facScale":
#> [1] 0.4
#>
#> Slot "bcv.common":
#> [1] 0.1
#>
#> Slot "bcv.df":
#> [1] 60
#>
#> Slot "nBatches":
#> [1] 1
#>
#> Slot "batchCells":
#> [1] 100
#>
#> Slot "batch.facLoc":
#> [1] 0.1
#>
#> Slot "batch.facScale":
#> [1] 0.1
#>
#> Slot "batch.rmEffect":
#> [1] FALSE
simPIC enables the estimation of its parameters from a SingleCellExperiment object or a count matrix using the simPICestimate
function.
In this section, we estimate parameters from a counts matrix using the default settings. The estimation process involves the following steps:
Estimating mean parameters by fitting a Weibull distribution (default) to the peak means.
Estimating library size parameters by fitting a log-normal distribution to the library sizes.
Estimating the sparsity parameter by fitting a Bernoulli distribution.
# Get the counts from test data
#counts <- readRDS(system.file("extdata", "test.rds", package = "simPIC"))
# Check that counts is a dgCMatrix
class(counts)
#> [1] "dgCMatrix"
#> attr(,"package")
#> [1] "Matrix"
typeof(counts)
#> [1] "S4"
# Check the dimensions, each row is a peak, each column is a cell
dim(counts)
#> [1] 90000 500
# Show the first few entries
counts[1:5, 1:5]
#> 5 x 5 sparse Matrix of class "dgCMatrix"
#> GTCACAAGTGGCATAG-1 CCATACCGTGAGGGTT-1
#> chr11-3622698-3623526 . .
#> chr1-40703222-40703996 . .
#> chr2-21316881-21317790 . .
#> chr18-3411756-3412657 . .
#> chr10-12446853-12447562 . .
#> AACAAAGGTCGTAGTT-1 ATTGTGGTCGATGTGT-1
#> chr11-3622698-3623526 . .
#> chr1-40703222-40703996 . .
#> chr2-21316881-21317790 . .
#> chr18-3411756-3412657 . .
#> chr10-12446853-12447562 . 1
#> TCTATTGGTCTGGGAA-1
#> chr11-3622698-3623526 .
#> chr1-40703222-40703996 .
#> chr2-21316881-21317790 .
#> chr18-3411756-3412657 .
#> chr10-12446853-12447562 .
new <- newsimPICcount()
new <- simPICestimate(counts)
#> simPIC is:
#> estimating library size parameters...
#> estimating sparsity...
#> estimating peak mean parameters...
#> using weibull distribution for estimating peak mean
## estimating using gamma distribution
## new <- simPICestimate(counts, pm.distr = "gamma")
For more details on the estimation procedures, refer to ?simPICestimate
.
Once we have a set of parameters, we can use the simPICsimulate
function to simulate counts. To modify the parameters, simply provide them as additional arguments. If no parameters are supplied, the default values will be used:
sim <- simPICsimulate(new, nCells = 500)
#> simPIC is:
#> updating parameters...
#> setting up SingleCellExperiment object...
#> Simulating library size...
#> Simulating peak mean...
#> using weibull distribution for simulating peak mean
#> Simulating true counts...
#> Done!!
sim
#> class: SingleCellExperiment
#> dim: 90000 500
#> metadata(1): Params
#> assays(3): BatchCellMeans BaseCellMeans counts
#> rownames(90000): Peak1 Peak2 ... Peak89999 Peak90000
#> rowData names(2): Peak exp.peakmean
#> colnames(500): Cell1 Cell2 ... Cell499 Cell500
#> colData names(3): Cell Batch exp.libsize
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
## simulating using gamma distribution
## sim <- simPICsimulate(new, nCells =500, pm.distr = "gamma")
The output of simPICsimulate
is a SingleCellExperiment
object, where sim
contains 90000 features (peaks) and 500 samples (cells). The main component of this object is a matrix of simulated counts, organized by features (peaks) and samples (cells), which can be accessed using the counts
function. In addition, the SingleCellExperiment
object stores metadata for each cell (accessible via colData
) and each peak (accessible via rowData
). simPIC uses these slots, along with assays
, to store intermediate values during the simulation process.
# Access the counts
counts(sim)[1:5, 1:5]
#> Cell1 Cell2 Cell3 Cell4 Cell5
#> Peak1 0 0 0 0 0
#> Peak2 0 0 0 0 0
#> Peak3 2 0 0 2 0
#> Peak4 0 0 0 1 1
#> Peak5 0 0 0 0 0
# Information about peaks
head(rowData(sim))
#> DataFrame with 6 rows and 2 columns
#> Peak exp.peakmean
#> <character> <numeric>
#> Peak1 Peak1 3.85424e-06
#> Peak2 Peak2 8.55952e-06
#> Peak3 Peak3 6.38664e-05
#> Peak4 Peak4 3.20358e-05
#> Peak5 Peak5 1.23968e-05
#> Peak6 Peak6 1.24275e-06
# Information about cells
head(colData(sim))
#> DataFrame with 6 rows and 3 columns
#> Cell Batch exp.libsize
#> <character> <character> <numeric>
#> Cell1 Cell1 Batch1 6152
#> Cell2 Cell2 Batch1 8176
#> Cell3 Cell3 Batch1 7966
#> Cell4 Cell4 Batch1 14514
#> Cell5 Cell5 Batch1 8053
#> Cell6 Cell6 Batch1 8274
# Peak by cell matrices
names(assays(sim))
#> [1] "BatchCellMeans" "BaseCellMeans" "counts"
The simPICsimulate
function provides additional simulation details:
Cell Information (colData
)
- Cell
: Unique cell identifier.
- exp.libsize
: Expected library size for that cell (not derived from the final simulated counts).
Peak Information (rowData
)
- Peak
: Unique peak identifier.
- exp.peakmean
: Expected peak means for that peak (not derived from the final simulated counts).
Peak-by-Cell Information (assays
)
- counts
: Final simulated counts.
For more information on the simulation process, see ?simPICsimulate
.
simPIC provides a function simPICcompare
that helps simplify the comparison between simulations and real data. This function takes a list of SingleCellExperiment
objects, combines the datasets, and produces comparison plots. Let’s create two small simulations and see how they compare.
sim1 <- simPICsimulate(nPeaks = 2000, nCells = 500)
#> simPIC is:
#> updating parameters...
#> setting up SingleCellExperiment object...
#> Simulating library size...
#> Simulating peak mean...
#> using weibull distribution for simulating peak mean
#> Simulating true counts...
#> Done!!
sim2 <- simPICsimulate(nPeaks = 2000, nCells = 500)
#> simPIC is:
#> updating parameters...
#> setting up SingleCellExperiment object...
#> Simulating library size...
#> Simulating peak mean...
#> using weibull distribution for simulating peak mean
#> Simulating true counts...
#> Done!!
comparison <- simPICcompare(list(real = sim1, simPIC = sim2))
names(comparison)
#> [1] "RowData" "ColData" "Plots"
names(comparison$Plots)
#> [1] "Means" "Variances" "MeanVar" "LibrarySizes" "ZerosPeak"
#> [6] "ZerosCell" "MeanZeros" "PeakMeanNzp"
The returned list from simPICcompare
contains three items:
The first two items are the combined datasets:
By peak (RowData
)
By cell (ColData
)
The third item contains comparison plots (produced using ggplot2
), such as a plot of the distribution of means.
comparison$Plots$Means
simPIC allows you to simulate data for multiple cell types, each with its own distinct accessibility profile. This is useful when you want to model more complex scenarios where different cell types contribute to the overall chromatin accessibility landscape.
To simulate multiple cell-types, set the method
parameter to groups
. This tells simPIC to simulate counts for multiple cell-types. You also need to specify the number of groups (cell-types) and their respective proportions using the nGroups
and group.prob
parameters.
For example, to simulate two cell types with equal proportions, you can use:
#counts <- readRDS(system.file("extdata", "test.rds", package = "simPIC"))
sim <- simPICsimulate(new, method = "groups",
nGroups = 2, group.prob = c(0.5, 0.5))
#> simPIC is:
#> updating parameters...
#> setting up SingleCellExperiment object...
#> Simulating library size...
#> Simulating peak mean...
#> using weibull distribution for simulating peak mean
#> Simulating group DA
#> Simulating group (cell) means
#> Simulating BCV
#> Simulating true counts groups...
#> Done!!
library(SingleCellExperiment)
library(scater)
#> Loading required package: scuttle
#> Loading required package: ggplot2
sim <- logNormCounts(sim)
sim <- scater::runPCA(sim)
plotPCASCE(sim,color_by="Group")
In above example, two cell-types were simulated in equal proportions (50% each). You can adjust the proportions to reflect your experimental design, such as having one dominant cell type. The group.probs
parameter should sum to 1. You can simulate any number of cell types by increasing nGroups
.
sim_multi <- simPICsimulate(new, method ="groups",
nGroups = 2,
group.prob = c(0.7, 0.3))
#> simPIC is:
#> updating parameters...
#> setting up SingleCellExperiment object...
#> Simulating library size...
#> Simulating peak mean...
#> using weibull distribution for simulating peak mean
#> Simulating group DA
#> Simulating group (cell) means
#> Simulating BCV
#> Simulating true counts groups...
#> Done!!
sim_multi <- logNormCounts(sim_multi)
sim_multi <- runPCA(sim_multi)
plotPCASCE(sim_multi, color_by="Group")
simPIC also supports adding batch effects, allowing you to simulate variations in peak accessibility across batches or conditions for different cell types. To simulate batch effects, you can use the nBatches
parameter to specify the number of batches, and the batchCells
parameter to define the number of cells in each batch.
set.seed(567)
sim_batch <- simPICsimulate(new, method="groups",
nGroups=2, nBatches=2,
group.prob=c(0.5, 0.5),
batchCells=c(250,250))
#> simPIC is:
#> updating parameters...
#> setting up SingleCellExperiment object...
#> Simulating library size...
#> Simulating peak mean...
#> using weibull distribution for simulating peak mean
#> Simulating batch effects...
#> Simulating group DA
#> Simulating group (cell) means
#> Simulating BCV
#> Simulating true counts groups...
#> Done!!
sim_batch <- logNormCounts(sim_batch)
sim_batch <- runPCA(sim_batch)
plotPCASCE(sim_batch, color_by="Batch", shape_by="Group")
If you use simPIC in your work please cite our paper:
citation("simPIC")
#> To cite package 'simPIC' in publications use:
#>
#> Chugh S, Shim H, McCarthy D (2025). _simPIC: Flexible simulation of
#> paired-insertion counts for single-cell ATAC-sequencing data_.
#> doi:10.18129/B9.bioc.simPIC
#> <https://doi.org/10.18129/B9.bioc.simPIC>, R package version 1.5.3,
#> <https://bioconductor.org/packages/simPIC>.
#>
#> A BibTeX entry for LaTeX users is
#>
#> @Manual{,
#> title = {simPIC: Flexible simulation of paired-insertion counts for single-cell
#> ATAC-sequencing data},
#> author = {Sagrika Chugh and Heejung Shim and Davis McCarthy},
#> year = {2025},
#> note = {R package version 1.5.3},
#> url = {https://bioconductor.org/packages/simPIC},
#> doi = {10.18129/B9.bioc.simPIC},
#> }
sessionInfo()
#> R version 4.5.0 (2025-04-11)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.2 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.22-bioc/R/lib/libRblas.so
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: America/New_York
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] scater_1.37.0 ggplot2_3.5.2
#> [3] scuttle_1.19.0 simPIC_1.5.3
#> [5] SingleCellExperiment_1.31.0 SummarizedExperiment_1.39.0
#> [7] Biobase_2.69.0 GenomicRanges_1.61.0
#> [9] GenomeInfoDb_1.45.1 IRanges_2.43.0
#> [11] S4Vectors_0.47.0 BiocGenerics_0.55.0
#> [13] generics_0.1.3 MatrixGenerics_1.21.0
#> [15] matrixStats_1.5.0 BiocStyle_2.37.0
#>
#> loaded via a namespace (and not attached):
#> [1] tidyselect_1.2.1 viridisLite_0.4.2 vipor_0.4.7
#> [4] dplyr_1.1.4 farver_2.1.2 viridis_0.6.5
#> [7] fastmap_1.2.0 digest_0.6.37 rsvd_1.0.5
#> [10] lifecycle_1.0.4 survival_3.8-3 magrittr_2.0.3
#> [13] compiler_4.5.0 rlang_1.1.6 sass_0.4.10
#> [16] tools_4.5.0 yaml_2.3.10 knitr_1.50
#> [19] S4Arrays_1.9.0 labeling_0.4.3 fitdistrplus_1.2-2
#> [22] DelayedArray_0.35.1 RColorBrewer_1.1-3 abind_1.4-8
#> [25] BiocParallel_1.43.0 withr_3.0.2 grid_4.5.0
#> [28] beachmat_2.25.0 scales_1.4.0 MASS_7.3-65
#> [ reached 'max' / getOption("max.print") -- omitted 44 entries ]