A PCA-Enhanced t-SNE Plot and Its Application in Biological and Medical Research

Authors

  • Meng Guo Department of Statistics and Data Science, Beijing Normal–Hong Kong Baptist University, 2000 Jintong Road, Tangjiawan, Zhuhai 519087, China
  • Haoyun Liu Department of Statistics and Data Science, Beijing Normal–Hong Kong Baptist University, 2000 Jintong Road, Tangjiawan, Zhuhai 519087, China
  • Jiajuan Liang Department of Statistics and Data Science, Beijing Normal–Hong Kong Baptist University, 2000 Jintong Road, Tangjiawan, Zhuhai 519087, China and Guangdong Provincial/Zhuhai Key Laboratory of Interdisciplinary Research and Application for Data Science, Beijing Normal–Hong Kong Baptist University, 2000 Jintong Road, Tangjiawan, Zhuhai 519087, China

DOI:

https://doi.org/10.6000/1929-6029.2025.14.76

Keywords:

Clustering, Gene expression data, k-means algorithm, Principal component analysis, projected F-test, t-SNE plot

Abstract

In this paper, we apply a two-step dimension reduction method, PCA-t-SNE to a real gene expression dataset as case study. It turns out that the PCA-t-SNE can signigicantly improve the visualization and cluster separation of high-dimensional biological data. While t-SNE alone often fails to reveal clear cluster structures in complex datasets, our approach first applies Principal Component Analysis (PCA) to reduce noise and dimensionality, followed by t-SNE to condense the data into a two-dimensional space and then apply the k-means to clustering the two-dimensional data. We demonstrate that PCA-t-SNE produces more distinct and interpretable clusters compared to the standard t-SNE. Statistical validation via a projected F-test for MANOVA confirms that clusters derived from PCA-t-SNE exhibit significantly greater mean separation, with lower p-values, underscoring the enhanced discriminative power of the method. The proposed PCA-t-SNE plot proves particularly effective for nonlinear data where conventional t-SNE performs poorly, offering a robust visualization tool and supporting the utility of sequential dimension reduction in exploratory data analysis for biological and medical research.

Purpose: This study aims to evaluate the effect from a combination of the classical PCA and the modern t-SNE technique for dimension reduction in clustering of high-dimensional gene expression data from the aspects of both visualization and MANOVA.

Methods: This paper presents a combined approach to dimension reduction for high-dimensional gene expression data. The effect of the visual approach is re-enhanced by the classical MANOVA method for large sample sizes (n > p) and the newly developed MANOVA method for small sample sizes (n < p).

Results: The proposed PCA t-SNE approach significantly improves the pure t-SNE approach for the selected gene expression dataset in the sense of clearer classification of the data from both visual observation and statistical significance tests. This provides a pre-processing of high-dimensional gene expression data before implementing the nonlinear dimension reduction, making the t-SNE approach more effective.

Contribution: We carry out a successful application of the two-step dimension reduction method PCA-t-SNE to a real gene expression dataset as case study. The idea of the PCA-t-SNE approach to visualizing high-dimensional gene expression data, enhanced by the projection-type MANOVA tests, opens a new way to discrimination of complex high-dimensional with statistical significance in the case of high dimension with a small sample size (n < p). It enhances the clustering of those nonlinear-type of data where the pure t-SNE almost fails to discriminate the clusters, and provides insight into a two-step dimension reduction approach.

References

Yoshida K, Toyoizumi T. A biological model of nonlinear dimensionality reduction. Science Advances 2025; 11(6). https://www.science.org/doi/10.1126/sciadv.adp9048 DOI: https://doi.org/10.1126/sciadv.adp9048

Jolliffe IT. Principal component analysis. Springer series in statistics. New York: Springer-Verlag; 2002.

Ringner M. What is principal component analysis? Nature Biotechnology 2008; 26(3): 303-304. DOI: https://doi.org/10.1038/nbt0308-303

Islam MR, Shatabda S. FeatPCA: A feature subspace based principal component analysis technique for enhancing clustering of single-cell RNA-seq data 2025: https://arxiv.org/abs/2502.05647

van der Maaten L, Hinton G. Visualizing data using t-SNE. Journal of Machine Learning Research 2008; 9: 2579-2605. https://www.jmlr.org/papers/v9/vandermaaten08a.html

Ketchen DJ, Shook CL. The application of cluster analysis in strategic management research: an analysis and critique. Strategic Management Journal 1996; 17(6): 441-458. DOI: https://doi.org/10.1002/(SICI)1097-0266(199606)17:6<441::AID-SMJ819>3.0.CO;2-G

Kobak D, Berens P. The art of using t-SNE for single-cell transcriptomics. Nature Communications 2019 Nov 28. https://www.nature.com/articles/s41467-019-13056-x DOI: https://doi.org/10.1038/s41467-019-13056-x

Yousuff M, Babu R, Anand Rathinam R. Nonlinear dimensionality reduction based visualization of single-cell RNA sequencing data. Journal of Analytical Science and Technology, 2024; 15(1). DOI: https://doi.org/10.1186/s40543-023-00414-0

Cao Y, Liang J. Multiple mean comparison for clusters of gene expression data through the t-SNE plot and PCA dimension reduction. International Journal of Statistics in Medical Research 2025; 14: 1-14. DOI: https://doi.org/10.6000/1929-6029.2025.14.01

Tsuyuzaki K, Sato H, Sato K, Nikaido I. Benchmarking principal component analysis for large-scale single-cell RNA-sequencing. Genome Biology 2020; 21(1): 9. DOI: https://doi.org/10.1186/s13059-019-1900-3

GeeksforGeeks. Principal component analysis (PCA) [Internet] 2018 Jul 7. https://www.geeksforgeeks.org/data-analysis/principal-component-analysis-pca/

Giraud C. Introduction to high-dimensional statistics. 2nd ed. Boca Raton: Chapman and Hall/CRC; 2021.

Peter BM. A geometric relationship of F2, F3 and F4-statistics with principal component analysis. Philosophical Transactions of the Royal Society B: Biological Sciences 2022; 377(1852): 20200413. DOI: https://doi.org/10.1098/rstb.2020.0413

Rychlik T. Projecting statistical functionals. Vol. 160. New York: Springer Science+Business Media; 2012.

Wattenberg M, Viégas F, Johnson I. How to Use t-SNE Effectively. Distill [Internet] 2016; 1(10). https://distill.pub/2016/misread-tsne/ DOI: https://doi.org/10.23915/distill.00002

Bibliography1.Arora S, Hu W, Kothari PK. An Analysis of the t-SNE Algorithm for Data Visualization. PMLR [Internet] 2018; 1455-62. https://proceedings.mlr.press/v75/arora18a. html

Linderman GC, Steinerberger S. Clustering with t-SNE, provably. SIAM Journal on Mathematics of Data Science 2019; 1(2): 313-32. https://pubmed.ncbi.nlm.nih.gov/ 33073204/ DOI: https://doi.org/10.1137/18M1216134

Gisbrecht A, Schulz A, Hammer B. Parametric nonlinear dimensionality reduction using kernel t-SNE. Neurocomputing 2015 ; 147: 71-82. DOI: https://doi.org/10.1016/j.neucom.2013.11.045

Li W, Cerise JE, Yang Y, Han H. Application of t-SNE to human genetic data. Journal of Bioinformatics and Computational Biology 2017; 15(04): 1750017. DOI: https://doi.org/10.1142/S0219720017500172

Johnstone IM, Titterington DM. Statistical challenges of high-dimensional data. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences 2009; 367(1906): 4237-53. DOI: https://doi.org/10.1098/rsta.2009.0159

Assent I. Clustering high dimensional data. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 2012; 2(4): 340-50. DOI: https://doi.org/10.1002/widm.1062

Boulesteix AL ., Strimmer K. Partial least squares: a versatile tool for the analysis of high-dimensional genomic data. Briefings in Bioinformatics 2006; 8(1): 32-44. DOI: https://doi.org/10.1093/bib/bbl016

Tenenbaum JB, Silva VD, Langford JC. A global geometric framework for nonlinear dimensionality reduction. Science 2000; 290: 2319-2323. DOI: https://doi.org/10.1126/science.290.5500.2319

Roweis ST. Nonlinear dimensionality reduction by locally linear embedding. Science 2000; 290(5500): 2323-6. DOI: https://doi.org/10.1126/science.290.5500.2323

McInnes L, Healy J, Saul N, Grobberger L. UMAP: Uniform Manifold Approximation and Projection. Journal of Open Source Software 2018; 3(29): 861. DOI: https://doi.org/10.21105/joss.00861

Borenstein M (Ed.), Meta-analysis: A guide to calibrating and combining statistical evidence. Wiley 2024.

Westfall PH, Young SS, Resampling-based multiple testing: Examples and methods for p-value adjustment. John Wiley & Sons 1993.

Downloads

Published

2025-12-26

How to Cite

Guo, M. ., Liu, H. ., & Liang, J. . (2025). A PCA-Enhanced t-SNE Plot and Its Application in Biological and Medical Research. International Journal of Statistics in Medical Research, 14, 844–854. https://doi.org/10.6000/1929-6029.2025.14.76

Issue

Section

Specia Issue: New Advances in Multiple Statistical Comparison and Its Applications in Medicine