14:30 – 14:50

Nick Ruest & Ian Milligan: See a little Warclight: building an open-source web archive portal with project blacklight

Nick Ruest, York University
Ian Milligan, University of Waterloo

In 2014-15, due to close collaboration between UK-based researchers and the UK Web Archive, the open-source Shine project was launched. It allowed faceted search, trend diagram exploration, and other advanced methods of exploring web archives. It had two limitations, however: it was based on the Play framework (which is relatively obscure especially within library settings) and after the Big UK Domain Data for the Arts and Humanities (BUDDAH) project came to an end, development largely languished.

The idea of Shine is an important one, however, and our project team wanted to explore how we could take this great work and begin to move it into the wider, open-source library community. Hence the idea of a Project Blacklight-based engine for exploring web archives. Blacklight, an open-source library discovery engine, would be familiar to library IT managers and other technical community members. But what if Blacklight could work with WARCs?

The Archives Unleashed team’s first foray towards what we now call “Warclight” — a portmanteau of Blacklight and the ISO-standardized Web ARChive file format — was building a standalone Blacklight Rails application. As we began to realize this doesn’t help those who would like to implement it, development pivoted to building a Rails Engine which, “allows you to wrap a specific Rails application or subset of functionality and share it with other applications or within a larger packaged application.” Put another way, it allows others to use an existing Warclight template to build their own web archive search application. Drawing inspiration from UKWA’s Shine, it allows faceted full-text search, record view, and other advanced discovery options. Warclight is designed to work with web archive data that is indexed via the UK Web Archive’s webarchive-discovery project.

Webarchive-discovery is a utility to parse ARCs and WARCs, and index them using Apache Solr, an open source search platform. Once these ARCs and WARCs have been indexed into Solr, it provides us with searchable fields including: title, host, crawl-date, and content type.

One of the biggest strengths of Warclight is that it is based on Blacklight. This opens up a mature open source community, which could allow us to go farther if we’re following the old idiom: “If you want to go fast, go alone. If you want to go further, go together.”

This presentation will provide and overview of Warclight, and implementation patterns. Including the Archives Unleashed at scale implementation of over 1 billion Solr docs using Apache SolrCloud.

14.50 – 15:10

Ditte Laursen & Niels Brügger: A national Web Trend Index based on national web archives

Niels Brügger, School of Communication and Culture – Media Studies, Aarhus University
Ditte Laursen, Royal Danish Library

A number of historical studies of national web already exist (Brügger and Laursen, in press.) but systematic basic information about a national web and its changes over time is lacking. This could be information about the number of websites, of specific file types, or of hyperlinks to social media platforms, as well as information about hyperlink structures or prevailing languages on a national web.

In this presentation, we will argue for the establishment of what we call a national Web Trend Index. Such an index can support future studies of the history of the web and be relevant for researchers, web archives, web companies, and civil society as an important source to understand national webs and their historical development. The national Web Trend Index should provide metrics for how national web domains have developed over time, and it must be flexible enough as to accomodate for new metrics to be included as the online web, the web collections, and the interests of all stakeholders change. The presentation will illustrate some of the most obvious metrics to include in such a national Web Trend Index and we will outline how the index can be built based on a systematic, transparent and reproducable approach. We will argue that a national Web Trend Index is best made and sustained in an organisationel setup including curators, developers and researchers. Finally, transnational perspectives for a Web Trend Index are discussed.

Brügger, N., & Laursen, D. (Eds.) (In press). The historical web and Digital Humanities: The case of national web domains. Routledge.

15:10 – 15:30

Jason Webber: Using secondary datasets for researchers under a legal deposit framework

The British Library

The UK Web Archive (UKWA) is a partnership of all six UK Legal Deposit Libraries that has attempted to collect the entire UK Web Space at least once per year since 2013. This material is collected under the Non-Print Legal Deposit Act 2003. This act allows UKWA to archive, without permission, all digitally published material that can be identified as UK owned or based. This generates millions of websites and billions of individual assets all of which is indexed. This vast resource is, however, strictly only viewable on the premises and within the control of UK Legal Deposit Libraries.

Whilst UKWA has developed a new interface that makes searching the Legal Deposit collection possible it doesn’t remove the significant barrier for researchers of having to come to a Library, apply for a readers pass (not simple) and use a Library terminal under some strict viewing conditions. This is only the barrier for researchers wanting to look at a few web pages. There is currently no easy to use facility for researchers wanting to do big data analysis across the whole Legal deposit collection, in large part due to having to do that research on-site at a Library.

A possible (and partial) solution is the use of secondary datasets. UKWA is legally unable to supply researchers with the actual websites or text or, in fact, anything that can be used to reconstruct the original works. What is possible, however, is supply facts about the collection and these facts can be incredibly valuable to researchers.

This presentation will discuss two use case projects that have utilised secondary datasets that have been created by researchers with help from UKWA staff. The first project used geographical data extracted from the UK Web and compared it to information available on businesses through Companies house. The second project used an algorithm to attempt to identify the polarity of words in the UK over time – how words may have changed their meaning.

The use of secondary datasets within web archiving can potentially solve the difficult legal position of many national libraries that collect under legal deposit or other strict access conditions. This presentation will, in part, be a call for more work to be done in this area to create environments that researchers can either work with existing datasets or create their own.

15:30 – 15:40

Q&A

Research use