Modeling Relational and Contextual Information into Topic Models and their Evaluation

Terragni, Silvia

Textual knowledge is one of the main pillars of our society. Indeed, human knowledge is often passed along using words. Since the invention of writing, humans have narrated and described their existence with words over pieces of papers. This amount of knowledge builds up to what the entire civilization has collected over more than 5'000 years. Historians and social and political scientists look for ways to understand better this vast amount of collective knowledge that cannot be manually explored. To this end, researchers from machine learning, statistics and computational linguistic have developed topic models, a suite of algorithms that aim to annotate large archives of documents with thematic information. The popularity of these models is due to the fact that they are unsupervised and that they are interpretable. Topic models analyze and summarize the main themes, or topics, of large collections of documents, presenting the information in a compact and understandable form. Most topic models focus only on the words encoded in the documents. However, additional information can be introduced into topic models to improve their performance. In fact, in many real-world cases, we seldom have only the mere texts to analyze. Instead, we have additional information or metadata related to the documents, e.g., the document's author, the date, hyperlinks to other documents, a set of hashtags, mentions or labels. We can use this prior information to help a topic model discover better topics. For example, knowing that a document cites another document increases our confidence that the documents talk about the same topics. Also, topic models often ignore word order and contextual information, making it difficult to infer high-quality topics. Another problem in the field is related to the hyperparameters used to train the topics models. The hyperparameters are often fixed in experimental settings and few researchers have tried to study their impact on the results. In this thesis, we aim to tackle the mentioned problems. We introduce novel families of topic models to obtain better performance. We also explore the issues related to hyperparameter optimization by designing and developing a novel tool to supply researchers with better guidelines on how to train a topic model.

La conoscenza testuale è uno dei pilastri principali della nostra società. Infatti, la conoscenza umana è spesso trasmessa attraverso le parole. Dall'invenzione della scrittura, gli uomini hanno raccontato e descritto la loro esistenza con parole su pezzi di carta. Questa quantità di conoscenza si accumula fino a ciò che l'intera civiltà ha raccolto in più di 5000 anni. Gli storici e gli scienziati sociali e politici cercano modi per capire meglio questa vasta quantità di conoscenza collettiva che non può essere esplorata manualmente. A questo scopo, i ricercatori di machine learning, statistica e linguistica computazionale hanno sviluppato i topic model, una suite di algoritmi che mirano ad annotare grandi archivi di documenti con informazioni tematiche. La popolarità di questi modelli è dovuta al fatto che sono non supervisionati e che sono interpretabili. I topic model analizzano e riassumono i temi principali, anche detti topic, di grandi collezioni di documenti, presentando le informazioni in una forma compatta e comprensibile. La maggior parte dei topic model si concentrano solo sulle parole codificate nei documenti. Tuttavia, informazioni aggiuntive possono essere introdotte nei topic model per migliorare le loro prestazioni. Infatti, in molti casi del mondo reale, raramente abbiamo solo i semplici testi da analizzare. Invece, abbiamo informazioni aggiuntive o metadati relativi ai documenti, ad esempio, l'autore del documento, la data, collegamenti ipertestuali ad altri documenti, un insieme di hashtag, menzioni o etichette. Possiamo usare queste informazioni precedenti per aiutare un topic model a scoprire topic di qualità. Per esempio, sapere che un documento cita un altro documento aumenta la nostra fiducia che i documenti parlano degli stessi argomenti.Inoltre, i topic model spesso ignorano l'ordine delle parole e le informazioni contestuali, rendendo difficile dedurre gli topic coerenti e significativi. Un altro problema nel campo è legato agli iperparametri usati per durante il processo di learning dei topic model. Gli iperparametri sono spesso fissati e pochi ricercatori hanno cercato di studiare il loro impatto sui risultati. In questa tesi, ci proponiamo di affrontare i problemi menzionati. Introduciamo nuove famiglie di topic model per ottenere prestazioni migliori. Esploriamo anche le questioni relative all'ottimizzazione degli iperparametri progettando e sviluppando un nuovo strumento per fornire ai ricercatori delle linee guida migliori su come usare un topic model.