IIPC RSS webinar: Web archiving social media and news websites
IIPC Research Speaker Series (RSS) focuses on the research use of web archives and features presentations of use cases, collaborative projects and new tools for researchers. This webinar will introduce four recent projects which focus on different aspects of capturing social media and frequent updates on news websites, particularly in the context of rapid response and event-based collections.
- The BESOCIAL Team: BESOCIAL – towards a sustainable social media archiving strategy for Belgium
- Gillian Lee & Ben O’Brien: Archiving Twitter in New Zealand
- Alex Osborne: Chronicrawl for capturing news content
BESOCIAL – towards a sustainable social media archiving strategy for Belgium
KBR, Royal Library of Belgium launched the BESOCIAL project in Summer 2020. The aim of this two-year project (2020-2022) is to set up a sustainable strategy for archiving and preserving social media in Belgium. Initially, the objective is to archive social media content related to certain Belgian events selected during the project, as well as the social media content related to the KBR’s newspaper collections. Furthermore, the project will explore how to open up the social media archive for use. These collections will complement the collections of websites archived during the PROMISE project (2017-2019). The BESOCIAL project team is led by KBR and includes: CENTAL (Centre for Natural Language Processing) at Université Catholique de Louvain, CRIDS (Research Centre in Information, Law and Society) at Namur University, GhentCDH (Ghent Centre for Digital Humanities), IDLab (Internet Technology & Data Science Lab) and MICT (Research Group for Media, Innovation and Communication Technologies) at Ghent University.
Archiving Twitter in New Zealand – Gillian Lee & Ben O’Brien
The National Library of New Zealand has been collecting tweets via the public Twitter API since late 2016. Our collecting has centered around NZ based events. On March 15th 2019, we started rapidly collecting tweets in response to the Christchurch Mosque attacks. This presentation will cover our rapid response collecting, workflows for post-processing Twitter data, and working with subsequent researcher requests.
Chronicrawl for capturing news content – Alex Osborne
The National Library of Australia has been experimenting with an alternative web crawling model based around scheduling revisits of individual pages rather than recrawling the whole website. Chronicrawl, is a prototype browser-based crawler that explores the problem of capturing frequently changing pages such as the front pages of news websites while capturing pages that change infrequently such as individual news stories at a slower rate. In addition to manual scheduling rules Chronicrawl has automatic scheduling mode that makes use of information from sitemaps and as well as adapting a schedule over time based on its own observations of whether content has changed.