Multiple Mean Comparison for Clusters of Gene Expression Data through the t-SNE Plot and PCA Dimension Reduction
DOI:
https://doi.org/10.6000/1929-6029.2025.14.01Keywords:
Dimension reduction, F -test, Gene expression data, Multiple mean comparison, t-SNE plotAbstract
This paper introduces a novel methodology for multiple mean comparison of clusters identified in gene expression data through the t-distributed Stochastic Neighbor Embedding (t-SNE) plot, which is a powerful dimensionality re- duction technique for visualizing high-dimensional gene expression data. Our approach integrates the t-SNE visualization with rigorous statistical testing to validate the differences between identified clusters, bridging the gap between exploratory and confirmatory data analysis. We applied our methodology to two real-world gene expression datasets for which the t-SNE plots provided clear separation of clusters corresponding to different expression levels. Our findings underscore the value of combining the t-SNE visualization with multiple mean comparison in gene expression analysis. This integrated approach enhances the interpretability of complex data and provides a robust statistical framework for validating observed patterns. While the classical MANOVA method can be applied to the same multiple mean comparison, it requires a larger total sample size than the data dimension and mostly relies on an asymptotic null distribution. The proposed approach in this paper has broad applicability in the case of high dimension with small sample sizes and an exact null distribution of the test statistic.
Objective: Propose a two-step approach to analysis of gene expression data.
Gene expression data usually possess a complicated nonlinear structure that cannot be visualized under simple linear dimension reduction like the principal component analysis (PCA) method. We propose to employ the existing t-SNE approach to dimension reduction first so that clusters among data can be clearly visualized and then multiple mean comparison methods can be further employed to carry out statistical inference. We propose the PCA-type projected exact F-test for multiple mean comparison among the clusters. It is superior to the classical MANOVA method in the case of high dimension and relatively large number of clusters.
Results: Based on a simple Monte Carlo study on a comparison between the projected F-test and the classical MANOVA Wilks’ Lambda-test and an illustration of two real datasets, we show that the projected F-test has better empirical power performance than the classical Wilks’ Lambda-test. After applying the t-SNE plot to real gene expression data, one can visualize the clear cluster structure. The projected F-test further enhances the interpretability of the t-SNE plot, validating the significant differences among the visualized clusters.
Conclusion: Our findings suggest that the combination of the t-SNE visualization and multiple mean comparison through the PCA-projected exact F-test is a valuable tool for gene expression analysis. It not only enhances the interpretability of high-dimensional data but also provides a rigorous statistical framework for validating the observed patterns.
References
Roweis ST, Saul KL. Nonlinear dimensionality reduction by locally linear embedding. Science 2000; 290: 2323-2326. https://doi.org/10.1126/science.290.5500.2323 DOI: https://doi.org/10.1126/science.290.5500.2323
Tenenbaum JB, Silva VD, Langford JC. A global geometric framework for nonlinear dimensionality reduction. Science 2000; 290: 2319-2323. https://doi.org/10.1126/science.290.5500.2319 DOI: https://doi.org/10.1126/science.290.5500.2319
Jolliffe IT. Principal Component Analysis. Springer, New York, 1986. https://doi.org/10.1007/978-1-4757-1904-8 DOI: https://doi.org/10.1007/978-1-4757-1904-8
van der Maaten L,Hinton G. Visualizing data using t-SNE. Journal of Machine Learning Research 2008; 9: 2579-2605.
Konstorum A, Jekel N, Vidal E, Laubenbacher R. Comparative analysis of linear and nonlinear dimension reduction techniques on mass cytometry data. BioRxi 2018. https://doi.org/10.1101/273862 DOI: https://doi.org/10.1101/273862
Macosko EZ, Basu A, Satija R, Nemesh J, Shekhar K, Goldman M, Regev A. Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell 2015; 161(5): 1202-1214. https://doi.org/10.1016/j.cell.2015.05.002
Amir ED, Davis KL, Tadmor MD, Simonds EF, Levine JH, Bendall SC, Pe’er D. viSNE enables visualization of high dimensional single-cell data and reveals phenotypic heterogeneity of leukemia. Nature Biotechnology 2013; 31(6): 545-552. https://doi.org/10.1038/nbt.2594 DOI: https://doi.org/10.1038/nbt.2594
Satija R, Farrell JA, Gennert D, Schier AF, Regev A. Spatial reconstruction of single-cell gene expression data. Nature Biotechnology 2015; 33(5): 495-502. https://doi.org/10.1038/nbt.3192 DOI: https://doi.org/10.1038/nbt.3192
Macosko EZ, Basu A, Satija R, Nemesh J, Shekhar K, Goldman M, Regev A. Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell 2015; 161(5): 1202-1214. https://doi.org/10.1016/j.cell.2015.05.002 DOI: https://doi.org/10.1016/j.cell.2015.05.002
Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society (Ser. B) 1995; 57(1): 289-300. https://doi.org/10.1111/j.2517-6161.1995.tb02031.x DOI: https://doi.org/10.1111/j.2517-6161.1995.tb02031.x
Dudoit S, Shaffer JP, Boldrick JC. Multiple hypothesis testing in microarray experiments. Statistical Science 2003; 18(1): 71-103. https://doi.org/10.1214/ss/1056397487 DOI: https://doi.org/10.1214/ss/1056397487
Reiner A, Yekutieli D, Benjamini Y. Identifying differentially expressed genes using false discovery rate controlling procedures. Bioinformatics 2003; 19(3): 368-375. https://doi.org/10.1093/bioinformatics/btf877 DOI: https://doi.org/10.1093/bioinformatics/btf877
Ketchen DJ, Shook CL. The application of cluster analysis in strategic management research: an analysis and critique. Strategic Management Journal 1996; 17(6): 441-458. https://doi.org/10.1002/(SICI)1097-0266(199606)17:6<441::AID-SMJ819>3.0.CO;2-G DOI: https://doi.org/10.1002/(SICI)1097-0266(199606)17:6<441::AID-SMJ819>3.0.CO;2-G
Wattenberg M, Vi´egas F, Johnson I. How to Use t-SNE Effectively. Distill 2016. https://doi.org/10.23915/distill.00002 DOI: https://doi.org/10.23915/distill.00002
Good PI. Permutation, Parametric and Bootstrap Tests of Hypotheses. Springer 2005.
Läuter J. Exact t and F tests for analyzing studies with multiple end-points. Biometrics 1996; 52(3): 964-970. https://doi.org/10.2307/2533057 DOI: https://doi.org/10.2307/2533057
Mardia KV. Tests of univariate and multivariate normality. Krishnaiah PR. ed. Handbook of Statistics, North-Holland Publishing Company 1980; 1: 279-320. https://doi.org/10.1016/S0169-7161(80)01011-5 DOI: https://doi.org/10.1016/S0169-7161(80)01011-5
Fang KT, Zhang Y. Generalized Multivariate Analysis. Springer-Verlag and Science Press, Berlin/Beijing 1990.
Mardia KV, Kent JT, Bibby JM. Multivariate Analysis. Academic Press, London and New York 1979.
Junttila S, Smolanda J, Elo, LL. Bench marking methods for detecting differential states between conditions from multi-subject single-cell RNA-seq data. Briefings in Bioinformatics 2022; 23(5): 1-14. https://doi.org/10.1093/bib/bbac286 DOI: https://doi.org/10.1093/bib/bbac286
Gezelius H, Enblad AP, Lundmark A, Aberg M, Blom K, Rudfeldt J, Raine A, Harila A, Rendo V, Heinäniemi M, Andersson C, Nordlund J. Comparison of high-throughput single-cell RNA-seq methods for ex vivo drug screening. NAR Genomics and Bioinformatics 2024; 6: 1-13. https://doi.org/10.1093/nargab/lqae001 DOI: https://doi.org/10.1093/nargab/lqae001
Gao X, Hu D, Gogo L, Li H. ClusterMap: compare multiple single cell RNA-Seq datasets across different experimental conditions. Bioinformatics 2019; 35(17): 3038-3045. https://doi.org/10.1093/bioinformatics/btz024 DOI: https://doi.org/10.1093/bioinformatics/btz024
Seyednasrollah F, Laiho A, Elo, LL. Comparison of software packages for detecting differential expression in RNA-seq studies. Briefings in Bioinformatics 2013; 16(1): 59-70. https://doi.org/10.1093/bib/bbt086 DOI: https://doi.org/10.1093/bib/bbt086
Downloads
Published
How to Cite
Issue
Section
License

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
Policy for Journals/Articles with Open Access
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are permitted and encouraged to post links to their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work
Policy for Journals / Manuscript with Paid Access
Authors who publish with this journal agree to the following terms:
- Publisher retain copyright .
- Authors are permitted and encouraged to post links to their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work .