SimSeq - Nonparametric Simulation of RNA-Seq Data
RNA sequencing analysis methods are often derived by
relying on hypothetical parametric models for read counts that
are not likely to be precisely satisfied in practice. Methods
are often tested by analyzing data that have been simulated
according to the assumed model. This testing strategy can
result in an overly optimistic view of the performance of an
RNA-seq analysis method. We develop a data-based simulation
algorithm for RNA-seq data. The vector of read counts simulated
for a given experimental unit has a joint distribution that
closely matches the distribution of a source RNA-seq dataset
provided by the user. Users control the proportion of genes
simulated to be differentially expressed (DE) and can provide a
vector of weights to control the distribution of effect sizes.
The algorithm requires a matrix of RNA-seq read counts with
large sample sizes in at least two treatment groups. Many
datasets are available that fit this standard.