IIPC Collaborative Collections are an ongoing annually funded project led by the IIPC Content Development Working Group (CDG). The collections are of higher value to research because they represent more perspectives than similar collections created by only one member archive. The collections are available not only on Archive-It but also through Bibliotheca Alexandrina (BA) using SolrWayback and LinkGate.
Archive-It access & data
IIPC Collaborative Collections are publicly available through the Internet Archive’s Archive-It service which provides full-text and faceted search. Currently 18 IIPC collections are available for browsing. Additionally, IIPC members can request logins for the IIPC Archive-It account to download WARC, WAT (Web Archive Transformation), LGA (Longitudinal Graph Analysis) and WANE (Web Archive Named Entities) files for research purposes. Non-IIPC researchers can sign agreements with IIPC for research use of IIPC collection WARC data (See guidelines).
Archives Research Compute Hub (ARCH), a new interface for web archive analysis currently being developed by the Archives Unleashed and Archive-It Teams, will provide more datasets derived from the collections. ARCH integrates Archives Unleashed Cloud’s analytical tools with Archive-It’s web archiving platform. Two of the Archives Unleashed cohort teams have been analysing the IIPC Covid-19 collection using the new service (See AWAC2 Project).
Access through Bibliotheca Alexandrina
SolrWayback, developed by the Royal Danish Library, is a search interface and Wayback machine for the UK Web Archive Solr-based WARC-indexer framework with additional features for research including: free text search in all resources; interactive link graph and Wordcloud generation for domains; N-gram search; large scale export of linkgraph in Gephi format; and more.
LinkGate is a data service, data extraction tool, and visualization front-end for scalable temporal graph visualization for web archive research. LinkGate is the result of an IIPC-funded collaboration between Bibliotheca Alexandrina and the National Library of New Zealand. LinkGate uses three components: the link service where linked data is stored, the link indexer for extracting outlink information from the web archive and inserting it into the link service, and the link visualizer for rendering and navigating linked data retrieved through the link service.
One of the goals of this project is to create “web archiving sandboxes,” for researchers. These small-scale subsets extracted from the larger collections are intended for getting started with web archive research and for demonstration purposes.
Collaborative collections: access
With the data in Archive-It, BA is providing the following to integrate the collections for ongoing republishing through SolrWayback and LinkGate as alternative access interfaces:
- Automation of the process of ongoing incremental data transfer from Archive-It to BA infrastructure
- Data storage for raw web crawl data and derived index
- Automation of the process of ongoing incremental indexing
- Compute time for indexing
- Server allocation for frontend and backend instances
Note: URLs for LinkGate and SolrWayback access are placeholders. The collections are in the process of being gradually synchronized with Archive-It.
- Ian Milligan, Jefferson Bailey, Nick Ruest, Samantha Fritz, Valérie Schafer and Frédéric Clavert: Design, Build, Use: Building a Computational Research Platform for Web Archives. WAC2022 session.
- Analysing Web Archives of the COVID-19 Crisis through the IIPC collaborative collection: early findings and further research questions
- Alex Thurman and Nicola Bingham: Collaborative Collection Development: Challenges and Opportunities. WAC2022 presentation.
- https://github.com/arcalex/LinkGate including research use cases for web archive graph visualization.