RESEARCHER ACCESS TO IIPC COLLECTIONS

IIPC Collaborative Collections are an ongoing annually funded project led by the IIPC Content Development Working Group (CDG). The collections are of higher value to research because they represent more perspectives than similar collections created by only one member archive. The collections are available not only on Archive-It but also through Bibliotheca Alexandrina (BA) using SolrWayback and LinkGate.

Archive-It access & data

IIPC Collaborative Collections are publicly available through the Internet Archive’s Archive-It service which provides full-text and faceted search. Currently 18 IIPC collections are available for browsing. Additionally, IIPC members can request logins for the IIPC Archive-It account to download WARC, WAT (Web Archive Transformation), LGA (Longitudinal Graph Analysis) and WANE (Web Archive Named Entities) files for research purposes. Non-IIPC researchers can sign agreements with IIPC for research use of IIPC collection WARC data (See guidelines).

Archives Research Compute Hub (ARCH), a new interface for web archive analysis currently being developed by the Archives Unleashed and Archive-It Teams, will provide more datasets derived from the collections. ARCH integrates Archives Unleashed Cloud’s analytical tools with Archive-It’s web archiving platform. Two of the Archives Unleashed cohort teams have been analysing the IIPC Covid-19 collection using the new service (See AWAC2 Project).

Access through Bibliotheca Alexandrina

Initiated in 2021, this project brings together IIPC collections and tools developed by IIPC members and creates synergy between two working groups: Research WG and Content Development WG.

Tools

SolrWayback, developed by the Royal Danish Library, is a search interface and Wayback machine for the UK Web Archive Solr-based WARC-indexer framework with additional features for research including: free text search in all resources; interactive link graph and Wordcloud generation for domains; N-gram search; large scale export of linkgraph in Gephi format; and more.

LinkGate is a data service, data extraction tool, and visualization front-end for scalable temporal graph visualization for web archive research. LinkGate is the result of an IIPC-funded collaboration between Bibliotheca Alexandrina and the National Library of New Zealand. LinkGate uses three components: the link service where linked data is stored, the link indexer for extracting outlink information from the web archive and inserting it into the link service, and the link visualizer for rendering and navigating linked data retrieved through the link service.

Sandboxes

One of the goals of this project is to create “web archiving sandboxes,” for researchers. These small-scale subsets extracted from the larger collections are intended for getting started with web archive research and for demonstration purposes.

Collaborative collections: access

With the data in Archive-It, BA is providing the following to integrate the collections for ongoing republishing through SolrWayback and LinkGate as alternative access interfaces:

Automation of the process of ongoing incremental data transfer from Archive-It to BA infrastructure
Data storage for raw web crawl data and derived index
Automation of the process of ongoing incremental indexing
Compute time for indexing
Server allocation for frontend and backend instances

Note: URLs for LinkGate and SolrWayback access are placeholders. The collections are in the process of being gradually synchronized with Archive-It.

Access Note (2023-08-17): LinkGate links currently unavailable, but access should be restored soon.

Year	Collections: Archive-It access	Size	At Bibliotheca Alexandrina
2022-	War in Ukraine	964 GB	LinkGate	SolrWayback
2020-	Novel Coronavirus (COVID-19)	5.5 TB	LinkGate	SolrWayback
2016-	National Olympic and Paralympic Committees	1.7 TB	LinkGate	SolrWayback
2015-	Intergovernmental Organizations	4.4 TB	LinkGate	SolrWayback
2014	2014 Winter Paralympics	1.3 TB	LinkGate	SolrWayback
2022	2022 Winter Olympics and Paralympics	361 GB	LinkGate	SolrWayback
2021	2020 Summer Olympics and Paralympics [held in 2021]	610 GB	LinkGate	SolrWayback
2021	Afghanistan Regime Change (2021) and the International Response	630 GB	LinkGate	SolrWayback
2019	Climate Change	1.2 TB	LinkGate	SolrWayback
2019	Artificial Intelligence	644 GB	LinkGate	SolrWayback
2018	Online News Around the World	1.8 TB	LinkGate	SolrWayback
2018	2018 Winter Olympics and Paralympics	1.2 TB	LinkGate	SolrWayback
2015-18	World War I Commemoration	5 TB	LinkGate	SolrWayback
2016	2016 Summer Olympics and Paralympics	3.1 TB	LinkGate	SolrWayback
2016	European Refugee Crisis	824 GB	LinkGate	SolrWayback
2014	2014 Winter Olympics	1.6 TB	LinkGate	SolrWayback
2012	2012 Summer Olympics		LinkGate	SolrWayback
2012	2012 Summer Paralympics		LinkGate	SolrWayback
2010	2010 Winter Olympics		LinkGate	SolrWayback

Resources

ARCH

AWAC2

Analysing Web Archives of the COVID-19 Crisis through the IIPC collaborative collection: early findings and further research questions

Collaborative Collections

LinkGate

SolrWayback