Kybernetika 48 no. 3, 478-493, 2012

Significance tests to identify regulated proteins based on a large number of small samples

Frank Klawonn


Modern biology is interested in better understanding mechanisms within cells. For this purpose, products of cells like metabolites, peptides, proteins or mRNA are measured and compared under different conditions, for instance healthy cells vs. infected cells. Such experiments usually yield regulation or expression values - the abundance or absence of a cell product in one condition compared to another one - for a large number of cell products, but with only a few replicates. In order to distinguish random fluctuations and noise from true regulations, suitable significance tests are needed. Here we propose a simple model which is based on the assumption that the regulation factors follow normal distributions with different expected values, but with the same standard deviation. Before suitable significance tests can be derived from this model, a reliable estimation for the standard deviation in the context of many small samples is needed. We therefore also include a discussion on the properties of the sample MAD ({\bf M}edian {\bf A}bsolute {\bf D}eviation from the median) and the sample standard deviation for small samples sizes.


MAD, standard deviation, small samples, significance test


93E12, 62A10


  1. S. Anders and W. Huber: Differential expression analysis for sequence count data. Genome Biology 11 (2010), R106.   CrossRef
  2. Y. Benjamini and Y. Hochberg: Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. Roy. Statist. Soc. Ser. B (Methodological) 57 (1995), 289-300.   CrossRef
  3. D. P. Berrar, M. Dubitzky, M. Granzow and eds.: A Practical Approach to Microarray Data Analysis. Springer, Dordecht 2009.   CrossRef
  4. F. P. Breitwieser, A. M{ü}ller, L. Dayon, T. K{ö}cher, A. Hainard, P. Pichler, U. Schmidt-Erfurth, G. Superti-Furga, J.-C., Sanchez, K. Mechtler, K. L. Bennett and J. Colinge: General statistical modeling of data from protein relative expression isobaric tags. J. Proteome Res. 10 (2011), 2758-2766.   CrossRef
  5. C. Croux and P. J. Rousseuw: Alternatives to the median absolute deviation. In: {Computational Statistics (Y. Dodge J. and Whittaker, eds.), Physica 1, Heidelberg 1992, pp. 411-428.}   CrossRef
  6. R. Gentleman, V. Carey, W. Huber, R. Irizarry and S. Dudoit: Bioinformatics and Computational Biology Solutions Using R and Bioconductor. Springer, New York 2005.   CrossRef
  7. S. Holm: A simple sequentially rejective multiple test procedure. Scand. J. Statist. 6 (1979), 65-70.   CrossRef
  8. C. Hundertmark, R. Fischer, T. Reinl, S. May, F. Klawonn and J. J{ä}nsch: MS-specific noise model reveals the potential of iTRAQ in quantitative proteomics. Bioinformatics 25 (2009), 1004-1011.   CrossRef
  9. F. Klawonn, C. Hundertmark and L. J{ä}nsch: A maximum likelihood approach to noise estimation for intensity measurements in biology. In: Proc. Sixth IEEE International Conference on Data Mining: Workshops (S. Tsumoto, C. W. Clifton, N. Zhong, X. Wu, J. Liu, B. W. Wah, and Y.-M. Cheung, eds.), IEEE, Los Alamitos 2006, pp. 180-184.   CrossRef
  10. F. Klawonn, T. W{ü}stefeld and L. Zender: Statistical modelling for data from experiments with short hairpin RNAs. In: Advances in Intelligent Data Analysis IX, Springer, Berlin 2010, pp. 79-90.   CrossRef
  11. R. Development Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna 2009, \url{}.   CrossRef
  12. M. D. Robinson and A. Oshlack: A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biology 11 (2010), R25.   CrossRef
  13. P. J. Rousseuw and C. Croux: Alternatives to the median absolute deviation. J. Amer. Statist. Assoc. 88 (1993), 1273-1283.   CrossRef
  14. J. P. Shaffer: Multiple gypothesis testing. Ann. Rev. Psych. 46 (1995), 561-584.   CrossRef
  15. G. K. Smyth: LIMMA: Linear models for microarray data. In: Bioinformatics and Computational Biology Solutions using R and Bioconductor (R. Gentleman, V. Carey, W. Huber, R. Irizarry, and S. Dudoit, eds.), Springer, New York 2005, pp. 397-420.   CrossRef