IIPC RSS Webinar: Mining web archives for linguistic analysis
IIPC Research Speaker Series (RSS) focuses on the research use of web archives and features presentations of use cases, collaborative projects and new tools for researchers. This webinar will introduce two projects which use data from the UK Web Archive and the web archive collections of the BnF. The presentations will be followed by a Q&A session.
DETECTING SEMANTIC CHANGE IN THE UK WEB ARCHIVE
Barbara McGillivray, Research Fellow at Alan Turing Institute and University of Cambridge
Pierpaolo Basile, Assistant Professor at the Department of Computer Science, University of Bari Aldo Moro, Italy
The project uses data from the UK Web Archive JISC dataset 1996-2013 to develop a system for detecting semantic change of words in the English language. The system is based on distributional semantic models and Temporal Random Indexing, a simple and effective way for building geometrical spaces of concepts from large textual datasets.
NÉONAUTE: MINING WEB ARCHIVES FOR LINGUISTIC ANALYSIS
Emmanuel Cartier and Loïc Galand, Laboratoire d’Informatique de Paris Nord (LIPN)
Néonaute is a project that seeks to study the use of neologisms in French using the web archive collections of the BnF. Initially a one-year project funded by the French Ministry of Culture, it uses a corpus drawn from the daily crawl of around one hundred news sites carried out by the BnF since December 2010 (representing 900 million files and 11TB of data). Néonaute is built on the existing projects Neoveille and Logoscope which seek to detect and track the life-cycle of neologisms.
From the full-text indexing of the news collection carried out by the BnF, additional analyses and processing are applied to identify relevant documents for the project (press articles), to retrieve relevant textual contents (boilerplate removal), and to enrich the indexes with linguistic information (morphosyntactical analysis) and extracted metadata (named entities, domain assignment). The presentation will discuss the technical challenges and the solutions adopted.
Néonaute includes three use cases:
- multidimensional analysis of the life-cycle of previously identified neologisms;
- comparative use of terms recommended by the DGLFLF (General Delegation for the French language and the languages of France, in charge of linguistic policy in France), versus terms already in use (especially Anglicisms);
- use of terms in feminine gender over the period.
The search engine interface is complemented with an interactive visualization module that allows users to explore the lifecycle of terms over the period, according to various parameters (themes of articles, journals, named entities implied, etc.).