Exploring the Future of Corpus Linguistics: Innovations in AI and Social Impact


  • Ersilia Incelli Sapienza University of Rome, Italy




AI-driven corpus linguistics, AI-generated text, AI integration, collocations, language processing, social implications


This paper explores the evolving landscape of corpus linguistics, focusing on the impact of artificial intelligence (AI) and its social implications. Over the past two decades, the study of language through corpus linguistics has evolved significantly, prompting ongoing reflection on the field's transformation. These reflections naturally give rise to pressing questions related to how corpus linguistics will evolve in a world defined by rapid technological progress and changing societal priorities. To validate the suppositions and reflections addressed in this contribution, the study explores a corpus that comprises scholarly papers from scientific journals, and a collection of AI-related articles taken from the media. This dual corpus enables a comparative analysis of how AI-driven corpus linguistics is represented, in order to explore how the integration of artificial intelligence is transforming corpus linguistics, and hence the methodological, theoretical, and socio-political implications of this shift. The methodological framework combines quantitative corpus analysis with qualitative discourse analysis. Collocation and keyword frequency retrieval is applied to identify prevalent themes. As expected academic literature emphasizes methodological advancements and data-driven rigor, while media discourse highlights ethical concerns and societal implications. These findings support the overview and contribute to understanding how AI is shaping both the practice and perception of corpus linguistics in contemporary society.


Alexander, Richard. 2009. Framing Discourse on the Environment: A Critical Discourse Approach. Routledge, New York. DOI: https://doi.org/10.4324/9780203890615

Baker, Paul, Costas Gabrielatos, and Tony McEnery. 2013. Discourse Analysis and Media Studies: Using Corpora. Routledge. DOI: https://doi.org/10.1017/CBO9780511920103

Biber, Douglas. 1993. “Representativeness in Corpus Design”. Literary and Linguistic Computing, 8(4): 243–257. DOI: https://doi.org/10.1093/llc/8.4.243

Biber, Douglas, Susan Conrad, and Randi Reppen. 1998. Corpus Linguistics: Investigating Language Structure and Use. Cambridge University Press. DOI: https://doi.org/10.1017/CBO9780511804489

Blei, David. Andrew Ng, and John Lafferty. 2003. “Latent Dirichlet allocation”. Journal of Machine Learning Research, 3: 993–1022.

Bleorţu, Cristina. 2024. The Use of Artificial Intelligence (AI) in Linguistics. Available at: https://www.researchgate.net/ publication/386568996_2024_The_Use_of_Artificial_Intelligence_AI_in_Linguistics.

Blodgett Su, Solon Barocas, Hal Daumé III, and Hanna Wallach. 2020. “Language (Technology) is Power: A Critical Survey of "Bias" in NLP”, Computation and Language. DOI: https://doi.org/10.18653/v1/2020.acl-main.485

Bolukbasi Tolga, Kai-Wei Chang, James Zou, Venkatesh Saligrama, Adam Kalai. 2016. “Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings”. Advances in Neural Information Processing Systems, 29.

Boyd Danah, Kate Crawford. 2012. “Critical Questions for Big Data”. Information, Communication & Society, 15(5): 662–679. DOI: https://doi.org/10.1080/1369118X.2012.678878

Brezina, Vaclav. 2018. Statistics in Corpus Linguistics: A Practical Guide. Cambridge University Press. DOI: https://doi.org/10.1017/9781316410899

Brookes, Gavin, and Luke Collins. 2023. Corpus Linguistics for Health Communication, A Guide for Research. Taylor & Francis Ltd, Routledge Books. DOI: https://doi.org/10.4324/9781003099659

Brown, A., Taylor, R., Wilson, K. 2019. “Automated feedback in language learning: A practical approach”. Language Learning Journal, 45(3): 245–261.

Danni Yu, Li Luyang, and Su Hang. 2023. Using LLM-assisted Annotation for Corpus Linguistics: A Case Study of Local Grammar Analysis. arXiv - CS - Artificial Intelligence.

Davies, Mark. 2008. "The Corpus of Contemporary American English (COCA): 520 Million Words, 1990–Present." Available online at https://www.english-corpora.org/coca/.

Davies, Mark. 2010. The Corpus of Historical American English (COHA): 400 Million Words, 1810-2009.

Dawn Knight, Steve Morris, Tess Fitzpatrick, Paul Rayson, Irena Spasić, Enlli Môn Thomas. 2020. “The national corpus of contemporary Welsh: project report | Y corpws cenedlaethol Cymraeg cyfoes: adroddiad y prosiect”. Project Report. CorCenCC.2018.

Finkel, Jenny. R., Trond Grenager, Christopher D. Manning. 2005. “Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling”. ACL. DOI: https://doi.org/10.3115/1219840.1219885

Fisas, Beatriz, Luis Espinosa Anke, Joan Codina-Filbá, and Leo Wanner. 2020. CollFrEn: Rich Bilingual English–French Collocation Resource. This is the repository for the MWE-LEX 2020.

Gomez-Jimenez, Eva M., and Michael Toolan, M. 2022. The Discursive Construction of Economic Inequality, CADS Approaches to the British Media. Bloomsbury publishing.

Gu Feng, 2023. “Corpus-based critical discourse analysis on AI Policy: A comparison between North America and Developing Countries in East Asia”. Asian Journal of Social Science Studies 8(3):14. DOI: https://doi.org/10.20849/ajsss.v8i3.1371

Incelli, Ersilia. 2021. “But what’s so bad about inequality? Ideological positioning and argumentation in the representation of economic inequality in the British Press”. Pp. 77 – 100 in Argumentation, Ideology and Discourse in Evolving Specialized Communication, edited by J. Bowker, E. Incelli, C. Prosperi-Porta, Lingua e Linguaggi. Special Issue, 42.

Jaworska, Sylvia. 2020. Corporate discourse, Cambridge University Press. DOI: https://doi.org/10.1017/9781108348195.031

Jaworska, Sylvia and Pierre larrivée. 2011. “Women, power and the media: Assessing the bias”, Journal of Pragmatics, 43(10):2477-2479. DOI: https://doi.org/10.1016/j.pragma.2011.02.008

Jiang, Shaohua, and Zeng Chen. 2024. Applications and Prospects of Artificial Intelligence in Linguistic Research. 3C Tecnología. Glosas de innovación aplicada a la pyme 13(1): 57-76.

Joshi, Pratik, Sebastin Santy, Amar Budhiraja, Kalika Bali, Monojit Choudhury. 2020. “The State and Fate of Linguistic Diversity and Inclusion in the NLP World”. ACL Anthology, pp. 6282–6293. DOI: https://doi.org/10.18653/v1/2020.acl-main.560

Jurafsky, Daniel, and James. H. Martin, 2024. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. (Updated Edition). Pearson.

Kalaš, Filip. 2025. Bridging Tradition and Innovation: Analysing Language Data with Chatgpt-4 in Corpus Linguistics. Available at SSRN. DOI: https://doi.org/10.2139/ssrn.5126316

Kilkarif, Adam, Pavel Rychly, Pavel Smerz, David, Tugwellet. 2006. The Sketch Engine, Lexicography MasterClass and ITRI, University of Brighton, U.K.

Lau, Ethan. 2024. “Advancements in Neural Machine Translation: Methodological Innovations and Empirical Insights for Cross-Linguistic Discourse Preservation.” International Journal for Research in Applied Science and Engineering Technology 12(4):5767-5772. DOI:10.22214/ijraset.2024.61039 DOI: https://doi.org/10.22214/ijraset.2024.61039

Manning, Christopher D., Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, David McClosky. 2014. Stanford CoreNLP: A Java Suite for NLP Tools.

McEnery, Tony, and Andrew Hardie. 2012. Corpus Linguistics: Method, Theory, and Practice. Cambridge University Press. DOI: https://doi.org/10.1017/CBO9780511981395

McEnery, Tony, and Gavin Brookes. 2024. “Corpus Linguistics and the Social Sciences”. Corpus Linguistics and Linguistic Theory. 20 (3): 591-613. DOI: https://doi.org/10.1515/cllt-2024-0036

Michel Jean-Baptiste, Yuan Kui Shen, Aviva Aiden, Adrian Veres. 2011. “Quantitative analysis of culture using millions of digitized books”. Science, 331(6014): 176–182. DOI: https://doi.org/10.1126/science.1199644

Morris, Jonathan, Ignatius Ezeani, Katharine Young, Lynne Davies, Mahmoud El-Haj, Gareth Watkins, Dawn, Knight (Eds). 2024. Language and Technology in Wales: Volume II. Bangor University.

Nelson Francis, W. N., and Henry. Kuçera, 1964. The Brown Corpus of Standard American English. Brown University Press.

Obermeyer, Ziad, Brian Powers, Christine Vogeli, and Sendhil Mullainathan. 2019. Dissecting racial bias in an algorithm used to manage the health of populations. Science 366: 447–453. DOI: https://doi.org/10.1126/science.aax2342

Partington, Alan. 2012. “Corpus Analysis of Political Language” in The Encyclopedia of Applied Linguistics, edited by C. A. Chapelle, Wiley. DOI: https://doi.org/10.1002/9781405198431.wbeal0250

Robertson, Michelle. 2025. The Vatican’s Stance on AI: Understanding Antiqua et Nova. OLA Communications Officer.

Gruetzemacher, Ross. 2022. “The Power of Natural Language Processing”. Harvard Business Review. April. 19.

Semino, Elena. 2022. “Health Communication”. Pp. 276-290 in Introducing Linguistics, edited by J. Culpeper, B, Malory, C. Nance, D., van Olmen, D., Atanasova, S. Kirkham, A. Casaponsa, London: Routledge.

Sinclair, John. 1991. Corpus, Concordance, Collocation. Oxford: Oxford University Press.

Stubbs, Michael. 2001. Words and Phrases: Corpus Studies of Lexical Semantics. Oxford: Blackwell.

Taylor, Charlotte. 2021. “Metaphors of migration over time”. Discourse and Society, 32, (4). DOI: https://doi.org/10.1177/0957926521992156

The Guardian, Dec. 14, 2024 The Guardian view on AI’s power, limits, and risks: it may require rethinking the technology. Accessed December 2024, https://www.theguardian.com/society/2024/dec/11/ai-tone-shifting-tech-could-flatten-communication-apple-intelligence.

The Guardian, Dec. 11, 2024. Losing our voice? Fears AI tone-shifting tech could flatten communication. Accessed December 2024, https://www.theguardian.com/society/ 2024/dec/11/ai-tone-shifting-tech-could-flatten-communication-apple-intelligence

The Guardian, 28 Dec. 2024. How will AI reshape 2025? Well, it could be the spreadsheet of the 21st century. Accessed December 2024, https://www.theguardian.com/ commentisfree/2024/dec/28/llms-large-language-models-gen-ai-agents-spreadsheets-corporations-work.

Vaswani Ashish, Noam Shazeer, Nikki Parmar, and Jakob Uszkoreit. 2017. “Attention is All You Need”. Advances in Neural Information Processing Systems, 30.

Zhao Jieyu, Wang Tianlu, Mark Yatskar, Chang Kai-Wei. 2019. “Gender Bias Gender Bias in Contextualized Word Embedding”. Proceedings of the 2019 Conference of the North. DOI: https://doi.org/10.18653/v1/N19-1064




How to Cite

Incelli, E. . (2025). Exploring the Future of Corpus Linguistics: Innovations in AI and Social Impact. International Journal of Mass Communication, 3, 1–10. https://doi.org/10.6000/2818-3401.2025.03.01


