A PCA-Enhanced t-SNE Plot and Its Application in Biological and Medical Research
DOI:
https://doi.org/10.6000/1929-6029.2025.14.76Keywords:
Clustering, Gene expression data, k-means algorithm, Principal component analysis, projected F-test, t-SNE plotAbstract
In this paper, we apply a two-step dimension reduction method, PCA-t-SNE to a real gene expression dataset as case study. It turns out that the PCA-t-SNE can signigicantly improve the visualization and cluster separation of high-dimensional biological data. While t-SNE alone often fails to reveal clear cluster structures in complex datasets, our approach first applies Principal Component Analysis (PCA) to reduce noise and dimensionality, followed by t-SNE to condense the data into a two-dimensional space and then apply the k-means to clustering the two-dimensional data. We demonstrate that PCA-t-SNE produces more distinct and interpretable clusters compared to the standard t-SNE. Statistical validation via a projected F-test for MANOVA confirms that clusters derived from PCA-t-SNE exhibit significantly greater mean separation, with lower p-values, underscoring the enhanced discriminative power of the method. The proposed PCA-t-SNE plot proves particularly effective for nonlinear data where conventional t-SNE performs poorly, offering a robust visualization tool and supporting the utility of sequential dimension reduction in exploratory data analysis for biological and medical research.
Purpose: This study aims to evaluate the effect from a combination of the classical PCA and the modern t-SNE technique for dimension reduction in clustering of high-dimensional gene expression data from the aspects of both visualization and MANOVA.
Methods: This paper presents a combined approach to dimension reduction for high-dimensional gene expression data. The effect of the visual approach is re-enhanced by the classical MANOVA method for large sample sizes (n > p) and the newly developed MANOVA method for small sample sizes (n < p).
Results: The proposed PCA t-SNE approach significantly improves the pure t-SNE approach for the selected gene expression dataset in the sense of clearer classification of the data from both visual observation and statistical significance tests. This provides a pre-processing of high-dimensional gene expression data before implementing the nonlinear dimension reduction, making the t-SNE approach more effective.
Contribution: We carry out a successful application of the two-step dimension reduction method PCA-t-SNE to a real gene expression dataset as case study. The idea of the PCA-t-SNE approach to visualizing high-dimensional gene expression data, enhanced by the projection-type MANOVA tests, opens a new way to discrimination of complex high-dimensional with statistical significance in the case of high dimension with a small sample size (n < p). It enhances the clustering of those nonlinear-type of data where the pure t-SNE almost fails to discriminate the clusters, and provides insight into a two-step dimension reduction approach.
References
Yoshida K, Toyoizumi T. A biological model of nonlinear dimensionality reduction. Science Advances 2025; 11(6). https://www.science.org/doi/10.1126/sciadv.adp9048 DOI: https://doi.org/10.1126/sciadv.adp9048
Jolliffe IT. Principal component analysis. Springer series in statistics. New York: Springer-Verlag; 2002.
Ringner M. What is principal component analysis? Nature Biotechnology 2008; 26(3): 303-304. DOI: https://doi.org/10.1038/nbt0308-303
Islam MR, Shatabda S. FeatPCA: A feature subspace based principal component analysis technique for enhancing clustering of single-cell RNA-seq data 2025: https://arxiv.org/abs/2502.05647
van der Maaten L, Hinton G. Visualizing data using t-SNE. Journal of Machine Learning Research 2008; 9: 2579-2605. https://www.jmlr.org/papers/v9/vandermaaten08a.html
Ketchen DJ, Shook CL. The application of cluster analysis in strategic management research: an analysis and critique. Strategic Management Journal 1996; 17(6): 441-458. DOI: https://doi.org/10.1002/(SICI)1097-0266(199606)17:6<441::AID-SMJ819>3.0.CO;2-G
Kobak D, Berens P. The art of using t-SNE for single-cell transcriptomics. Nature Communications 2019 Nov 28. https://www.nature.com/articles/s41467-019-13056-x DOI: https://doi.org/10.1038/s41467-019-13056-x
Yousuff M, Babu R, Anand Rathinam R. Nonlinear dimensionality reduction based visualization of single-cell RNA sequencing data. Journal of Analytical Science and Technology, 2024; 15(1). DOI: https://doi.org/10.1186/s40543-023-00414-0
Cao Y, Liang J. Multiple mean comparison for clusters of gene expression data through the t-SNE plot and PCA dimension reduction. International Journal of Statistics in Medical Research 2025; 14: 1-14. DOI: https://doi.org/10.6000/1929-6029.2025.14.01
Tsuyuzaki K, Sato H, Sato K, Nikaido I. Benchmarking principal component analysis for large-scale single-cell RNA-sequencing. Genome Biology 2020; 21(1): 9. DOI: https://doi.org/10.1186/s13059-019-1900-3
GeeksforGeeks. Principal component analysis (PCA) [Internet] 2018 Jul 7. https://www.geeksforgeeks.org/data-analysis/principal-component-analysis-pca/
Giraud C. Introduction to high-dimensional statistics. 2nd ed. Boca Raton: Chapman and Hall/CRC; 2021.
Peter BM. A geometric relationship of F2, F3 and F4-statistics with principal component analysis. Philosophical Transactions of the Royal Society B: Biological Sciences 2022; 377(1852): 20200413. DOI: https://doi.org/10.1098/rstb.2020.0413
Rychlik T. Projecting statistical functionals. Vol. 160. New York: Springer Science+Business Media; 2012.
Wattenberg M, Viégas F, Johnson I. How to Use t-SNE Effectively. Distill [Internet] 2016; 1(10). https://distill.pub/2016/misread-tsne/ DOI: https://doi.org/10.23915/distill.00002
Bibliography1.Arora S, Hu W, Kothari PK. An Analysis of the t-SNE Algorithm for Data Visualization. PMLR [Internet] 2018; 1455-62. https://proceedings.mlr.press/v75/arora18a. html
Linderman GC, Steinerberger S. Clustering with t-SNE, provably. SIAM Journal on Mathematics of Data Science 2019; 1(2): 313-32. https://pubmed.ncbi.nlm.nih.gov/ 33073204/ DOI: https://doi.org/10.1137/18M1216134
Gisbrecht A, Schulz A, Hammer B. Parametric nonlinear dimensionality reduction using kernel t-SNE. Neurocomputing 2015 ; 147: 71-82. DOI: https://doi.org/10.1016/j.neucom.2013.11.045
Li W, Cerise JE, Yang Y, Han H. Application of t-SNE to human genetic data. Journal of Bioinformatics and Computational Biology 2017; 15(04): 1750017. DOI: https://doi.org/10.1142/S0219720017500172
Johnstone IM, Titterington DM. Statistical challenges of high-dimensional data. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences 2009; 367(1906): 4237-53. DOI: https://doi.org/10.1098/rsta.2009.0159
Assent I. Clustering high dimensional data. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 2012; 2(4): 340-50. DOI: https://doi.org/10.1002/widm.1062
Boulesteix AL ., Strimmer K. Partial least squares: a versatile tool for the analysis of high-dimensional genomic data. Briefings in Bioinformatics 2006; 8(1): 32-44. DOI: https://doi.org/10.1093/bib/bbl016
Tenenbaum JB, Silva VD, Langford JC. A global geometric framework for nonlinear dimensionality reduction. Science 2000; 290: 2319-2323. DOI: https://doi.org/10.1126/science.290.5500.2319
Roweis ST. Nonlinear dimensionality reduction by locally linear embedding. Science 2000; 290(5500): 2323-6. DOI: https://doi.org/10.1126/science.290.5500.2323
McInnes L, Healy J, Saul N, Grobberger L. UMAP: Uniform Manifold Approximation and Projection. Journal of Open Source Software 2018; 3(29): 861. DOI: https://doi.org/10.21105/joss.00861
Borenstein M (Ed.), Meta-analysis: A guide to calibrating and combining statistical evidence. Wiley 2024.
Westfall PH, Young SS, Resampling-based multiple testing: Examples and methods for p-value adjustment. John Wiley & Sons 1993.
Downloads
Published
How to Cite
Issue
Section
License

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
Policy for Journals/Articles with Open Access
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are permitted and encouraged to post links to their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work
Policy for Journals / Manuscript with Paid Access
Authors who publish with this journal agree to the following terms:
- Publisher retain copyright .
- Authors are permitted and encouraged to post links to their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work .