Reflections on Distributions: Human, Machines, Languages

Geng, Mingmeng

Humans create and continuously improve machines, while machines also influence human life. Language, one of the key differences between humans and animals, is also one of the main ways humans and machines communicate with each other. Many behaviors of machines are quite different from those of humans, and I aim to analyze them from the perspective of "distribution". Survey data is a central focus of my research. I tried to identify potentially more efficient combinations of variables for survey sample selection and post-stratification, with the help of tree-based models, including Random Forests, XGBoost, and LightGBM. Large language models (LLMs) play an important role in my projects. Using the European Social Survey (ESS) data as a benchmark, I measured biases and stereotypes in several LLMs for different subjective questions with the corresponding demographic variables, and the influence of different prompts. The widespread use of LLMs is impacting human society. I proposed a model based on word frequency and simulation to estimate the impact of LLMs on academic writing and presentations. In over a million papers and more than 1,000 conference presentations, the impact of LLMs has increased over time. I have also observed the co-evolution of humans and LLMs. Adopting the lens of "distribution" has proven beneficial for these tasks, which deserves further attention and reflection. Other impacts of artificial intelligence (AI) on human society have also been carefully analyzed and discussed, such as ethical issues, model collapse, paradigm shifts, and more.

Reflections on Distributions: Human, Machines, Languages

GENG, MINGMENG

2025

Abstract

Humans create and continuously improve machines, while machines also influence human life. Language, one of the key differences between humans and animals, is also one of the main ways humans and machines communicate with each other. Many behaviors of machines are quite different from those of humans, and I aim to analyze them from the perspective of "distribution". Survey data is a central focus of my research. I tried to identify potentially more efficient combinations of variables for survey sample selection and post-stratification, with the help of tree-based models, including Random Forests, XGBoost, and LightGBM. Large language models (LLMs) play an important role in my projects. Using the European Social Survey (ESS) data as a benchmark, I measured biases and stereotypes in several LLMs for different subjective questions with the corresponding demographic variables, and the influence of different prompts. The widespread use of LLMs is impacting human society. I proposed a model based on word frequency and simulation to estimate the impact of LLMs on academic writing and presentations. In over a million papers and more than 1,000 conference presentations, the impact of LLMs has increased over time. I have also observed the co-evolution of humans and LLMs. Adopting the lens of "distribution" has proven beneficial for these tasks, which deserves further attention and reflection. Other impacts of artificial intelligence (AI) on human society have also been carefully analyzed and discussed, such as ethical issues, model collapse, paradigm shifts, and more.

Scheda breve

Scheda completa

Scheda completa (DC)

	Corso di studio
	
				Theory and Numerical Simulation of Condensed Matter
			
	Data di pubblicazione
	
				3-feb-2025
			
	Lingua
	
				Inglese
			
	Relatore, Supervisor, Advisor o Tutor
	
				Trotta, Roberto
Rozza, Gianluigi
			
	Nome Editore
	
				SISSA
			
	Città Editore
	
				Trieste
			
	Collezione di appartenenza
	
				Scuola Internazionale Superiore di Studi Avanzati di Trieste

File in questo prodotto:

File	Dimensione	Formato
Mingmeng Geng PhD Thesis.pdf accesso aperto Licenza: Tutti i diritti riservati Dimensione 27.48 MB Formato Adobe PDF Visualizza/Apri	27.48 MB	Adobe PDF	Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/190121

Il codice NBN di questa tesi è URN:NBN:IT:SISSA-190121