Browser-based crawling system for all

Project leads:
Anders Klindt Myrvoll, Royal Danish Library
Andrew Jackson, British Library
Ben O’Brien, National Library of New Zealand
Lauren Ko, University of North Texas


Project lead & developer:
Ilya Kreymer, Webrecorder.net

Funding:
30,000 USD (2022)
30,000 USD (contingent upon 2023 project renewal)

Resources:

GitHub repository
Website
Updates (IIPC blog)


Brief description of the project

The goal of this proposal is to support the creation of a flexible, browser-based high fidelity crawling system driven by a full-featured user interface and accessible to curators and web archivists at any institution. The crawling system will focus on enabling the capture of complex, dynamic websites which can not be adequately captured with existing crawling tools such as Heritrix. The system will be built to be ‘cloud native’ and support running in the cloud, as well as in a local institutional environment. The core crawling engine will extend the Webrecorder Browsertrix Crawler system (https://github.com/webrecorder/browsertrix-crawler) for browser-based crawling, and the system will be built in a modular way to allow for future extensibility and customizations.

Webrecorder Software will lead the development of the software system, which will be part of its efforts to create an open source high-fidelity web archiving system. The IIPC partners, Netarkivet – the Danish Web Archive at the Royal Danish Library, in partnership with the UK Web Archive, the University of North Texas Libraries and the National Library of New Zealand will help establish product design requirements. The IIPC partners provide concrete use cases, help guide the development and contribute to testing and local deployment of the crawling systems. This group of partners will help ensure that the system can meet the varying needs of IIPC members, both libraries crawling on a national scale as well as smaller institutions. We hope this project will push the limits of browser-based crawling and provide a more concrete understanding of what is feasible and where the limits may be with a browser-based crawling approach.

Goals, outcomes and deliverables

The deliverables will be an open-source, high-fidelity browser-based crawling system featuring:

  • A well-defined REST API (OpenAPI spec) for crawl management
  • A well-defined JSON-based crawl specification for defining crawl parameters and schedules.
  • An intuitive user-interface for defining crawls (converted into the JSON-based spec)
  • An intuitive user-interface for starting and monitoring crawls, and observing crawl logs/reports
  • Instantaneous replay of crawled content during or after crawl
  • A UI-based workflow for logging into websites via a remote browser, saving a browser profile, and launching crawls with pre-existing profiles.
  • An at least beta-quality automated QA system for evaluating the completeness of a crawl according to one metric.
  • A manual patching mechanism, for interactively patching via a remote browser.
  • Support for deployment under Docker and Kubernetes.
  • Standardized crawl artifacts (definitions, logs, screenshots)
  • Documentation for deploying this system

The development will be led by Webrecorder, and will consist of a small team, including the lead developer (Ilya Kreymer), a senior frontend developer (part-time), a part-time UX/product designer and a project manager role. The UX/product designer will help ensure the planned features make sense within the overall product and the interface is clean and easy to use, and the frontend developer will be responsible for implementing the UI. The project manager will ensure that the project is progressing on schedule and to spec.

How the project furthers the IIPC strategic plan

This project closely aligns with many of the goals in the IIPC strategic plan. A key goal of this work is to enhance the capability to archive complex and difficult web content in a consistent and easy-to-use way. Another goal is to work collaboratively to create a modular open-source system which provides a common feature set for browser-based crawling, and which can be used not only by the IIPC partners on this proposal, but any IIPC member, and the web archiving community at large. The partners on this project represent national libraries as well as smaller archiving institutions with the goal of ensuring this work can meet the varied needs of different web archiving institutions based on use case-driven development. In addition to creating a working product, we hope the approaches taken in this project will further the shared understanding of browser-based crawling and its limits and possibilities , as well as fostering further work in this area.

Detailed description of the project

The growing complexity of the web makes browser-based crawling essential for institutions to be able to capture many modern websites, including most social media sites. To date, no integrated, open source solution exists to run crawls that are comprehensive, scheduled, and browser-based (to enable automated capture of sites that are not possible to fully capture with Heritrix and other traditional approaches). During several of the IIPC open source community calls, the participants reached a consensus that browser-based crawling is a top priority, or even “needed yesterday”.

The Webrecorder project has specialized in developing high-fidelity capture tools, focusing on interactive browser-based capture. Webrecorder has also built the Browsertrix crawler system, which currently provides a low-level browser-based crawler inside a single Docker container.

Browsertrix Crawler can now be launched via command-line to run a single crawl at a time with a variety of low-level configuration options, including configuring crawl scope, number of browser workers and optional full text search extraction.

In this project, the goal will be to build on the existing Browsertrix Crawler component to provide a full-fledged user-friendly system with internationalization support, accessible to institutional curators and the web archiving community at large.

The system will be used to support the institutions in running high-fidelity crawls for a subset of seeds which currently do not work well with traditional crawling methods. The goal of the system is not to replace Heritrix, but to augment existing approaches with a new, targeted approach and explore the possibilities of browser-based crawling.

To provide maximum flexibility, the system will be ‘cloud-native’ and run in Kubernetes as well as in Docker and Docker Compose or Swarm to allow for similar local or on-site deployment.

The development of the project is divided into four quarterly phases (see below).

The development will be led by Webrecorder, and will consist of a small team, including the lead developer (Ilya Kreymer), a senior frontend developer (part-time), a part-time UX/product designer and a project manager role. The UX/product designer will help ensure the planned features make sense within the overall product and the interface is clean and easy to use , and the frontend developer will be responsible for implementing the UI. The project manager will ensure that the project is progressing on schedule and to spec.

The IIPC partners will be active partners in testing the system from the early stages to ensure that it meets their goals and expectations. Webrecorder will also use this as a key component for any crawling projects that it undertakes, and will seek to find supplementary funding to support development, such as through additional contract work. During the lifetime of this project, the institutional project partners will have the opportunity to experiment with and evaluate this new approach to crawling, and decide whether and how they wish to support the project outputs in the future (e.g. through direct funding or future IIPC-funded work).

We anticipate the expected impact to the field to be quite significant, as this crawling system will be designed to meet an urgent need within the web archiving community. We hope that the testing and deployment by the IIPC partners on this project will help pave the way for broader use and adoption by a greater number of IIPC members and anyone in the web archiving community interested in setting up a high fidelity crawling system.

Risks and mitigation

We understand that this is an ambitious project and does have some risks, especially around whether it proves possible to deliver in full and on time. We will address these risks through transparent development and project management via GitHub and open communication about the progress, and by using the four project phases to manage delivery and payment.

Webrecorder will have a quarterly check-in with the IIPC Partners, in the middle and at the end of each milestone. At the end of the first quarter, progress towards the milestone will be evaluated and any necessary changes made to ensure completion at the end of the second quarter.

This project aligns with Webrecorder’s broader goal of creating an open source high-fidelity crawling service. Webrecorder anticipates that building and operating such a service will become a key part of its sustainability as an open source project.  Webrecorder anticipates additional funding around automated high-fidelity web archiving that could further support this development. At this time, Webrecorder is a subcontractor on two related grants, one from IMLS and one from Mellon Foundation, related to high-fidelity web archiving with Browsertrix Crawler.

All code deliverables will be licensed under the GNU Affero General Public License v3.0 (AGPL).

Packages 1 & 2 (2022)

Q1 & Q2

The goal will be to create a well-defined crawling API, and ensure that the core scheduling system is operational, can run crawls, deposit WARC files, and generate basic logs. Webrecorder will establish a cloud instance of the system for testing by all of the project partners, and begin working with each institution to support local deployment in their environment. A basic UI, perhaps mostly aimed at advanced users and developers, will begin to take shape.

Milestones

  • Initial design requirements are established
  • Initial Crawling REST API Operational
  • Crawling systems deployable in Docker and Kubernetes
  • Webrecorder operates a hosted version of the system for testing.

Q3 & Q4

The core user interface will be ready for testing by non-developers, allowing for crawls to be created and run on the infrastructure by end users. The user interface will receive an initial UX pass by this time.

Initial (beta) support for logged-in crawling, providing users the ability to interactively log in to certain sites, especially social media, and, save browser profiles and use them during a crawl will also be implemented.

Challenges around logged-in crawling (e.g. expiring credentials, IP-based restrictions) will also be identified. Initial deployment documentation focused on the IIPC community is created and usable with feedback from the IIPC partners.

Milestones

  • Webrecorder incorporates feedback from IIPC partners on initial deployment from Q1 & Q2.
  • Core UI operations implemented: logging in, creating crawl definitions, scheduling crawls, stopping crawls, viewing completed crawls, browsing replay.
  • Logged-in workflow implemented (users log in to sites, then crawl with logged-in browser profile) at least in a beta stage.
  • Initial deployment documentation for the IIPC community is created, with feedback from IIPC partners.

Packages 3 & 4 (2023)

Q5 & Q6

  • The focus will be on implementing manual QA features. Support for manual QA via interactive user-driven patching, either using a remote browser in capture mode or Webrecorder’s existing extension, will be implemented. By this phase, the requirements of deploying the system at each institution will also be clearly understood, and any core changes necessary to support local deployments will be implemented.
  • Any final improvements to the crawling API, as required to meet requirements at different institutions, will also be implemented.

Milestones

  • Webrecorder incorporates feedback from IIPC partners on initial deployment from Q3 & Q4.
  • Manual QA via patching using either remote browser and/or extension is supported.
  • Backend API finalized, adjustments made to meet requirements from IIPC partners.
  • IIPC partners are able to deploy the system locally, deployment documentation updated.

Q7 & Q8

The focus will be on exploring of the various approaches to automated QA to reduce the amount of manual QA. We will explore several approaches, such as re-running a crawl on the archived data (the replay) and comparing results, such as missing URLs and JavaScript errors, producing screenshots. This exploration will result in alpha/beta automated QA features and will help evaluate feasibility of further work around automated QA. Webrecorder can operate a production-ready, cloud version of the system for those that do not wish to run their own.

The capabilities of the system will also be evaluated to better understand how well browser-based crawling can (or cannot) scale, and what limitations may exist. The documentation for running the system in different environments (Kubernetes and Docker) will also be created.

Milestones

  • Webrecorder incorporates feedback from IIPC partners on initial deployment from Q6 & Q7.
  • Initial explorations/alpha version of automated QA created using one approach created.
  • Final blog post created documenting the features and limitations, future plans for browser-based crawling.
  • Final documentation on deployment and capabilities of the system.

Schedule of completion

  • Package 1: by 1 June 2022
  • Package 2: by 1 December 2022
  • Package 3: by 1 June 2023 (contingent upon 2023 project renewal)
  • Package 4: by 1 December 2023 (contingent upon 2023 project renewal)