Regex Club Live: Collaborative Exploration of Crawl Logs and Crawler Traps – Session Two

This Regex Club Live event is for IIPC members. Registration link is available in the Members-only Archive. Contact staff[at]netpreserve.org if you don’t have access.

Crawler traps are a familiar issue for the web archiving community, leading to crawls capturing erroneous or useless material, creating an unnecessary environmental and financial burden, and potentially missing in-scope content when crawls are force-finished. But dealing with crawler traps requires specialist knowledge which, at the UK Government Web Archive, has been built up informally over time, and as such is unevenly distributed throughout the team. Over the last few years, we have sought to address this with a regular Crawl Log Review workshop (informally known as ‘Regex Club’) where problematic crawls are discussed, their logs analysed, and appropriate reject regex patterns are identified and tested. This has become a forum for knowledge sharing and capacity building within our team. We would now like to share more widely with the wider web archiving community.

Following a successful first session in March 2025, we invite you to join us for a follow-up event. This time, we would like you to bring your crawler trap conundrums for us to look at as a group. If you have an example you would like to share, send the details listed below to webarchive[AT]nationalarchives.gov.uk by Friday, October 3 so we can put together a schedule:

  • Short description of the problem
  • Either the full crawl log or a snippet which provides examples of the type of url causing the problem

You will need to be willing to explain the case and answer questions during the session. If you don’t have an example to share this time, you can still sign up! The aim is to share experiences and learn from one another. We will also launch the knowledge bank we have compiled of solutions we have developed for common problems. 

Speakers

The UK Government Web Archive Team

About The UK Government Web Archive

The UK Government Web Archive, part of The National Archives (UK), preserves and makes accessible UK central government information published on the web. The Web Archive includes videos, tweets, images and websites dating from 1996 to the present day. Our initial work on crawler traps was presented as a poster at IIPC WAC 2024.

The event is finished.

Date

16 Oct 2025
Expired!

Time

UTC
1:00 PM - 2:00 PM

Local Time

  • Timezone: America/New_York
  • Date: 16 Oct 2025
  • Time: 9:00 AM - 10:00 AM

Labels

Member's event,
Members only

Category

Next Event