RESEARCHER ACCESS TO IIPC COLLECTIONS

IIPC Collaborative Collections are an ongoing annually funded project led by the IIPC Content Development Working Group (CDG). The collections are of higher value to research because they represent more perspectives than similar collections created by only one member archive. The collections are available not only on Archive-It but also through Bibliotheca Alexandrina (BA) using SolrWayback and LinkGate.

Archive-It access & data

IIPC Collaborative Collections are publicly available through the Internet Archive’s Archive-It service which provides full-text and faceted search. Currently 18 IIPC collections are available for browsing. Additionally, IIPC members can request logins for the IIPC Archive-It account to download WARC, WAT (Web Archive Transformation), LGA (Longitudinal Graph Analysis) and WANE (Web Archive Named Entities) files for research purposes. Non-IIPC researchers can sign agreements with IIPC for research use of IIPC collection WARC data (See guidelines).

Archives Research Compute Hub (ARCH), a new interface for web archive analysis currently being developed by the Archives Unleashed and Archive-It Teams, will provide more datasets derived from the collections. ARCH integrates Archives Unleashed Cloud’s analytical tools with Archive-It’s web archiving platform. Two of the Archives Unleashed cohort teams have been analysing the IIPC Covid-19 collection using the new service (See AWAC2 Project). 

Access through Bibliotheca Alexandrina

Initiated in 2021, this project brings together IIPC collections and tools developed by IIPC members and creates synergy between two working groups: Research WG and Content Development WG.

Tools

SolrWayback, developed by the Royal Danish Library, is a search interface and Wayback machine for the UK Web Archive Solr-based WARC-indexer framework with additional features for research including: free text search in all resources; interactive link graph and Wordcloud generation for domains; N-gram search; large scale export of linkgraph in Gephi format; and more. 

LinkGate is a data service, data extraction tool, and visualization front-end for scalable temporal graph visualization for web archive research. LinkGate is the result of  an IIPC-funded collaboration between Bibliotheca Alexandrina and the National Library of New Zealand. LinkGate uses three components: the link service where linked data is stored, the link indexer for extracting outlink information from the web archive and inserting it into the link service, and the link visualizer for rendering and navigating linked data retrieved through the link service.

Sandboxes 

One of the goals of this project is to create “web archiving sandboxes,” for researchers. These small-scale subsets extracted from the larger collections are intended for getting started with web archive research and for demonstration purposes.

Collaborative collections: access

With the data in Archive-It, BA is providing the following to integrate the collections for ongoing republishing through SolrWayback and LinkGate as alternative access interfaces:

  • Automation of the process of ongoing incremental data transfer from Archive-It to BA infrastructure
  • Data storage for raw web crawl data and derived index
  • Automation of the process of ongoing incremental indexing
  • Compute time for indexing
  • Server allocation for frontend and backend instances

Note: URLs for LinkGate and SolrWayback access are placeholders. The collections are in the process of being gradually synchronized with Archive-It.

Access Note (2023-08-17): LinkGate links currently unavailable, but access should be restored soon. 

Year Collections: Archive-It access Size At Bibliotheca Alexandrina
2022- War in Ukraine 964 GB LinkGate SolrWayback
2020- Novel Coronavirus (COVID-19) 5.5 TB LinkGate SolrWayback
2016- National Olympic and Paralympic Committees 1.7 TB LinkGate SolrWayback
2015- Intergovernmental Organizations 4.4 TB LinkGate SolrWayback
2014 2014 Winter Paralympics 1.3 TB LinkGate SolrWayback
2022 2022 Winter Olympics and Paralympics 361 GB LinkGate SolrWayback
2021 2020 Summer Olympics and Paralympics [held in 2021] 610 GB LinkGate SolrWayback
2021 Afghanistan Regime Change (2021) and the International Response 630 GB LinkGate SolrWayback
2019 Climate Change 1.2 TB LinkGate SolrWayback
2019 Artificial Intelligence 644 GB LinkGate SolrWayback
2018 Online News Around the World 1.8 TB LinkGate SolrWayback
2018 2018 Winter Olympics and Paralympics 1.2 TB LinkGate SolrWayback
2015-18 World War I Commemoration 5 TB LinkGate SolrWayback 
2016 2016 Summer Olympics and Paralympics 3.1 TB LinkGate SolrWayback
2016 European Refugee Crisis 824 GB LinkGate SolrWayback
2014 2014 Winter Olympics 1.6 TB LinkGate SolrWayback
2012 2012 Summer Olympics LinkGate SolrWayback
2012 2012 Summer Paralympics LinkGate SolrWayback
2010 2010 Winter Olympics LinkGate SolrWayback

Resources

ARCH

AWAC2 

Collaborative Collections

LinkGate

SolrWayback