Asking questions with web archives – introductory notebooks for historians

Project lead:
Andrew Jackson, British Library

Project lead & developer:
Tim Sherratt, University of Canberra

Project partners:
National Library of Australia & National Library of New Zealand

Funding:
3,500 USD

Brief description of the project
Goals, outcomes and, deliverables
How the project furthers the IIPC strategic plan
Detailed description of the project
Project schedule of completion
Project outcomes
Final report

Resources:

GitHub repository
GLAM Workbench (Web Archives)
Jupyter Notebooks
Zenodo

Brief description of the project

This project aims to create a set of Jupyter notebooks that will demonstrate how specific historical research questions can be explored by analysing data from web archives. The notebooks will be targeted at researchers who have limited understanding of, or interest in, the technology of web archives, but want to do more than simply browse snapshots.

To avoid overwhelming researchers with the scale and scope of web archives, the notebooks created for this project will work with data from IIPC members available through Wayback, Memento, and CDX APIs. They will introduce tools and technologies gradually – building the understanding and confidence of researchers. By using case studies and questions inspired by the collections of project partners, they will highlight both the research potential of web archives and the value of IIPC cooperation.

These will not just be another set of tutorials. By using Jupyter notebooks, the project will provide live code and practical examples that yield immediate research benefits, while also bringing together distributed documentation in a form that can be understood by researchers with limited digital skills. While the project is deliberately focused on helping historians understand how their research might be enriched by web archives, the information and examples provided will be useful to any researcher seeking to develop their knowledge and skills.

Goals, outcomes and, deliverables

The project will create between 5 and 10 Jupyter notebooks. The notebooks will be CC-BY licensed to encourage reuse and adaptation, and stored in a public GitHub repository. The repository will include a ‘requirements.txt’ file to document the required software environment, and to make it easy for the notebooks to be run live on Binder without any need for researchers to install software on their own systems.

At least 5 of the notebooks will provide detailed tutorials, exploring particular historical questions by introducing relevant sources of data, tools and technologies. Additional notebooks will complement the tutorials by providing additional examples, quick hacks, or useful standalone tools.

The notebooks will be featured in a new section of the GLAM Workbench focused on web archives, and promoted through social media.

How the project furthers the IIPC strategic plan

This project will support the work of the IIPC’s Partnerships and Outreach portfolio by raising awareness of Internet preservation issues and initiatives through training activities. In particular, it will contribute directly to the first short-term action listed under this portfolio, to ‘engage and support researcher involvement in, and benefit from, member activities’. This project aims to build understanding of the value of web archives by getting researchers actively working with them. As such, this work supports the goals of the newly-formed IIPC Research Working Group.

The project will also contribute to efforts within the Membership Engagement portfolio by providing another example of how member activities can be documented and shared in reusable forms. The project partners will provide feedback and guidance throughout the project to ensure that the notebooks support their activities in an appropriate form. The notebooks themselves will be made publicly available under an open licence so that any IIPC member can adapt them to meet their own requirements and highlight their own activities.

This project also aligns with the aims of the Tools Development portfolio to develop a ‘rich interoperable tools environment’. The notebooks will make use of existing APIs and share code in an easily reusable form.

Detailed description of the project

Most historians of the 1980s or beyond will need to use web archives. While this statement seems self-evident, it hides a thorny problem. How do we encourage historians to develop the skills they need to make effective use of web archives? Or to put it slightly differently, how do we make web archives as accessible for appraisal and analysis as any other primary source by historians who would never claim to be specialists in digital research? How do we embed the use of web archives within historical practice?

Developing digital skills and confidence amongst historians can be difficult enough without confronting the immense challenges of scale that web archives bring. This project aims to overcome some of these barriers by focusing on the sorts of questions a researcher might ask of web archives, and introducing tools and technologies only as required to explore those questions. Jupyter notebooks will draw data as necessary from the Wayback, Time Travel, and CDX APIs, minimising the need to grapple directly with WARCs and distributed computing environments.

For example, a notebook might take familiar questions relating to change over time and start exploring these within the context of single organisation’s home page. How can you identify spans and shifts? A next step might involve assembling a set of domains to compare their trajectories. Where particular changes are observed, the notebooks can show how to drill down and assemble a set of captures for more detailed analysis. Methods for comparing, categorising, and extracting features from archived web pages can then be explored.

While pursuing these sorts of questions the notebooks will also encourage researchers to reflect on the nature of the archive itself. What has been captured and why? What might be missing? As with any historical source, researchers need to ask questions about the context and creation of the archive if it is to be embedded within their methodologies.

Jupyter notebooks provide an ideal platform for bringing together these elements of exploration, learning, and critical analysis. By combining live code, text, images and more into a ‘computational narrative’, Jupyter notebooks blur the line between tool and tutorial – they enable researchers to undertake real research tasks while learning about the underpinning data and technology. Jupyter notebooks are widely used in data science, and their use in the digital humanities is growing.

The development of the notebooks will be undertaken by Dr Tim Sherratt who has assembled a large and growing collection of Jupyter notebooks for use by humanities researchers in the GLAM Workbench. The GLAM Workbench provides researchers with examples, tools, and documentation to help them explore and use the online collections of libraries, archives, and museums. For example, there are already notebooks to help researchers access and analyse data from the National Libraries of Australia and New Zealand via the Trove and DigitalNZ APIs.

A Slack instance (or similar) will be created to facilitate communication between project participants. Project partner institutions will provide guidance on data sources and technologies as required. Draft versions of the notebooks will be made available for their evaluation and feedback. Once finalised, the notebooks will be made available through a public GitHub repository and a new section in the GLAM Workbench. The support of the IIPC and the contributions of the project partners will be fully acknowledged in both the GLAM Workbench and the repository README file.

This is a small, targeted project with a clear set of deliverables that makes use of existing APIs. No obstacles are envisaged. The equivalent of 10 days work has been budgeted for the development of the notebooks. This will be spread over a period of three months to allow for feedback from project partners. It is expected that the project will be completed by July 2020.

Project schedule of completion

9 March 2020 – Project starts. Slack instance created for communication. Initial ideas for notebook themes and topics shared.

1 April 2020 – Development of the notebooks starts. GitHub repository created for sharing work in progress. Partners will provide feedback.

30 April 2020 – Finalised notebooks shared through GitHub repository and GLAM Workbench.

Project outcomes

Jupyter notebooks for web archives were presented at the IIPC RSS webinars on 5 and 6 August 2020.