IIPC RSS Webinar: PANDORÆ and Hyphe: Open-Source Tools and Methodologies for Researching Web Archives
IIPC Research Speaker Series (RSS) focuses on the research use of web archives and features presentations of use cases, collaborative projects and new tools for researchers. This webinar will showcase two open-source tools designed to assist with the building and in-depth exploration of research corpora: PANDORÆ and the Hyphe crawler. A Q&A will follow the presentations.
PANDORÆ: A Methodology for Systematic Exploration of Full-Text Indexed Web Archives
Qualitative fieldwork on web archives can easily be daunting due to the profusion of available data. How can researchers make sure their fieldwork reaches a satisfactory “saturation level” through which they can isolate the signal from the noise? PANDORÆ is a free and open-source software designed by and for social scientists to harvest documents from established sources and provides algorithms to normalize, curate, and explore such corpora.
PANDORÆ has two pipelines that can work independently from the other: one to retrieve data to constitute corpora, and another one to explore document collections. The first pipeline connects to full-text web archive Solr endpoints and builds a corpus by retrieving document metadata in a way consistent with the applicable legal framework. It then normalizes the documents following the CSL-JSON standard and uploads them to a user-managed Zotero library, which serves as a database search and curation software. The second pipeline retrieves curated collections from Zotero and enables their exploration through data-visualization heuristics. These explorations are not only based on the document metadata but also on in-document data (such as a document full-text search and navigating amongst available captures) if and only if the software is being run within an authorized network.
This software-led methodology also comes with two online interactive fieldwork notebooks that guide users in exploring their topic. Taken together, the software and the notebooks allow for an exhaustive exploration of a given topic over a preselected period in a full-text indexed web archive collection.
The presentation will detail the capacity of PANDORÆ on web archives through the example of research currently underway at the Bibliothèque Nationale de France (BnF).
SPEAKER:
Guillaume Levrier is a political scientist working on biotechnology. He is also the lead developer of PANDORÆ, a free and open-source software for scientific research. He is currently an associate researcher at Sciences Po (CEVIPOF) and the Bibliothèque Nationale de France (BnF).
Building Web Corpora from the Live and Archived Web Using the Hyphe Crawler
Developed by Sciences Po médialab as an open-source software (https://github.com/medialab/hyphe), Hyphe was designed to provide researchers and students with a research-oriented crawler to build and enrich corpora of websites through a qualitative fieldwork methodology. It provides a method and a tool to build a research corpus from web content (web pages and HTTP links) with an innovative “curation-oriented” step-by-step expansion approach meant to address two of the main social sciences problems when working with automatized web mining: how to build a theme-focused corpus and how to delineate an actor’s presence on the web.
A step-by-step iterative process supports Hyphe users in dynamically curating and defining “web entities” in a way that is both granular and flexible by choosing single pages, a subdomain, a combination of websites, etc. The pages residing under these entities are then crawled in order to extract the outgoing links and part of the textual contents. The most cited “web entities” can then be prospected manually in order to enrich the corpus before visualizing it in the form of a network and exporting it for cleaning and analysis in other tools such as Gephi.
In partnership with the two French web archiving teams, Hyphe was recently adapted to crawl web archives from the National Library of France (BnF) and the Audiovisual National Institute (INA) as well as from archive.org, empowering users to build web corpora from the past or to complete web corpora from the live web with archives of disappeared websites.
SPEAKER:
Trained as a multidisciplinary engineer, Benjamin Ooghe-Tabanou specialises in applying computer science to scientific research. After multiple experiences within the astrophysics field at Johns Hopkins University in the USA and École Normale Supérieure in France, Benjamin entered the social sciences field first as an OpenData and Parliament transparency activist. He joined Sciences Po’s médialab as a research engineer in 2012, focused on web mining and developing open source tools for social sciences, and he has led médialab’s research engineers team since 2020.