Since 2004, the BootCaT software is being developed to help linguists quickly build disposable corpora for translation, terminological databases and machine-learning tasks via automatic web pages' collection based on user-defined keywords. The present work attempts to utilize the software for the creation of comparable and diachronic web-corpora of different languages (English, German and Italian). It reports how the standard BootCat procedure has been adapted and integrated for this purpose and discusses the quality and usefulness of the obtained corpora. Taking these results into consideration, it recommends adjustments for future research attempts in this direction and hypothesizes ideal developments of linguistics in the automation of text collection and analysis.

Casting one's net wide in the web: BootCaT as a tool for comparable and diachronic specialized-corpora collection

-
2020

Abstract

Since 2004, the BootCaT software is being developed to help linguists quickly build disposable corpora for translation, terminological databases and machine-learning tasks via automatic web pages' collection based on user-defined keywords. The present work attempts to utilize the software for the creation of comparable and diachronic web-corpora of different languages (English, German and Italian). It reports how the standard BootCat procedure has been adapted and integrated for this purpose and discusses the quality and usefulness of the obtained corpora. Taking these results into consideration, it recommends adjustments for future research attempts in this direction and hypothesizes ideal developments of linguistics in the automation of text collection and analysis.
2020
it
Dipartimento di Studi Linguistici e Culturali
Università degli Studi di Modena e Reggio Emilia
File in questo prodotto:
File Dimensione Formato  
Francesco_Luccarda_846252_casting_ones_net_wide_in_the_web____bootcat_as_a_tool_for_comparable_and_diachronic_specialized_corpora_collection.pdf

non disponibili

Tipologia: Altro materiale allegato
Licenza: Tutti i diritti riservati
Dimensione 3.44 MB
Formato Adobe PDF
3.44 MB Adobe PDF

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/302017
Il codice NBN di questa tesi è URN:NBN:IT:UNIMORE-302017