National Library of China

The Key Technologies of Web Information Preservation and Service System Platform

The web archives of the National Library of China (NLC) date back to 2003, along with the blooming of Internet technology in China. During the past 15 years, we’ve been committed to archive the government public information and important website records in China and abroad. The NLC united the libraries nationwide to create web archives in 2011. By the end of 2018, the NLC and over 300 public libraries have participated in the project. We work closely to crawl and serve the government public information, integrate the web archives and provide services to the public.

Based on the Heritrix, the NLC’s web archives have been crawled, cataloged and preserved since 2005. With the explosive growth of web archives and the rapid development of network technology, we’ve been keeping up with the technology of web archiving and developing the software to promote the platform service. We aim at building a more open, shared and compatible platform to fulfill different demands. In order to attract more libraries nationwide to join us, we try hard to make the web archives easier and more convenient. In order to offer a better crawling operation, we upgrade technology and develop a set of “web archiving service platform” on the basis of the distributed cloud storage infrastructure.

Its capability of data processing supports the management and use of at least 1 million metadata. The platform is designed with eight functional modules for modular, lightweight deployment and service for other libraries to deploy their own modules. It adopts a distributed cloud infrastructure, which ensures us to work closely with multiple libraries (organizations) to archive online. This poster will highlight the function and characteristics of the platform and how we make it work. We look forward to sharing our ideas and methods with you and learning more information from you.


Ben O’Brien, National Library of New Zealand
Jeffrey Van Der Hoeven, KB – National Library of the Netherlands

Technical Uplift of the Web Curator Tool

Colleagues at the National Library of New Zealand and the National Library of the Netherlands are continuing to develop the Web Curator Tool (WCT) after releasing version 2.0 in December 2018. This poster will highlight the 2019 enhancements and what we learned in the process.

The goal of v2.0 was to uplift the crawling capability of the WCT by integrating Heritrix 3. This addressed what was seen as the most deficient area of the WCT. It was discovered during a proof-of-concept that the Heritrix 3 integration could be achieved without significant upgrade of the WCT’s outdated libraries and frameworks. But further functionality could not be developed until those libraries and frameworks had been uplifted, providing a stable modern base for new functionality. Now that v2.0 has been completed, the next milestone in the WCT development is to perform this technical uplift.

Besides the technical uplift, two other items of work on the development plan for WCT in the first half of 2019 are: component-based REST APIs and documenting user journeys.

We want to make the WCT much more flexible and less coupled by exposing each component via an API layer. In order to make that API development much easier we are looking to migrate the existing SOAP API to REST and changing components so that they are less dependent on each other. One of those components is the Harvest Agent, which acts as a wrapper for the current Heritrix 3 crawler we use. Our goal is to develop this component to integrate with additional web crawlers, such as Brozzler.

The process of mapping user journeys, the way users interact with the WCT, is long overdue. Future development will involve writing unit and/or integration tests that cover those essential user journeys. These tests will be used to ensure that all essential functionality remains through all development changes.

This poster and lightning talk will cover the exercise of upgrading a 13 year old Java application, migrating components of it to use REST APIs and the likely challenges and pitfalls that we will encounter. We also hope to share any insights from documenting the WCT user journeys. If possible, we would prefer to submit a digital poster so that we can embed short demos of any new WCT functionality and demonstrate invoking another crawler from within the WCT.


British Library

From the sidelines to the archived web: What are the most annoying football phrases in the UK?

As the news and TV coverage of football has increased in recent years, there has been growing interest in the type of language and phrases used to describe the game. Online, there have been numerous news articles, blog posts and lists on public internet forums on what are the most annoying football clichés. However, all these lists focus on the men’s game and finding a similar list on women’s football online was very challenging. Only by posting a tweet with a survey to ask the public “What do you think are the most annoying phrases to describe women’s football?” was I able to collate an appropriate sample to work through.

Consequently, the lack of any such list in a similar format highlights the issue of gender inequality online as this is a reflection of wider society. I filtered a sample of the phrases from men’s and women’s football to find the top five most annoying phrases. I then ran these phrases through the UK Web Archive Shine interface to determine their popularity on the archived web. The UK Web Archive Shine interface was first developed in 2015, as part of the Big UK Domain Data for the Arts and Humanities project. This presentation will assess how useful the Trends function on the Shine interface is to determine the popularity of a sample of selected football phrases from 1996 to 2013 on the UK web. The Shine interface searches across 3,520,628,647 distinct records from .uk domain, captured from January 1996 to the 6th April 2013.

It is hoped that the findings from this study will be of interest to the footballing world but more importantly, encourage further research in sports and linguistics using the UK Web Archive.

Helena Byrne. (2018). What do you think are the most annoying phrases to describe women’s football? https://footballcollective.org.uk/2018/05/18/what-do-you-think-are-the-most-annoying-phrases-to-describe-womens-football/ (Accessed August 26, 2018)
Andrew Jackson. (2016). Introducing SHINE 2.0 – A Historical Search Engine. Retrieved from: http://blogs.bl.uk/webarchive/2016/02/updating-our-historical-search-service.html (Accessed August 26, 2018)


University library in Bratislava

Archiving and LTP of websites and Born Digital Documents in the Slovak Republic

Electronic documents and web sites should be preserved similarly to physical objects of lasting value. This is performed using the long-term storage platform. In 2015 the University Library in Bratislava (ULB) carried into operation a system for a controlled web harvesting and e-Born archiving (as a result of the national project Digital Resources – Web Harvesting and e-Born Content Archiving). Nowadays, the project is in the phase of sustainability and all activities are provided by the department Deposit of Digital Resources.

This contribution focuses on the specific solution of the archiving of the Slovak web sites and Born Digital documents and their long-term preservation (LTP). The archiving is carried out in the Information System Digital Resources (IS DR) and archived resources are delivered to the Central Data Archive (CDA), which serves as the LTP storage. The Central Data Archive is designed and operates in compliance with the requirements and standards for trusted long-time storages (ISO 16363, ISO 14721).

We will present the process from the archiving of the content in the IS DR to its storing in the CDA. The data are delivered to the CDA in form of Submission Information Packages (SIPs). The integrated creation of SIP files in the Deposit of Digital Resources is an efficient semi-automatic solution with a minimal intervention by the curator. Every SIP is a compressed ZIP file (in compliance with the CDA requirements) and contains descriptive metadata and archived files. A programmed script creates packages, signs and saves them in a temporary repository. Every SIP is signed by SSL certificate – the certificate authority is the CDA. The SIPs, confirmed by the curator, are transferred to a temporary CDA repository and waiting for further processing. After the successful validation, verification and format control, SIPs are transformed into Archival Information Packages (AIPs). The generated AIP number is added to the IS DR.


Common Crawl

Accessing WARC files via SQL

Similar to many other web archiving initiatives Common Crawl uses WARC as primary storage format and a CDX index to look up WARC records by URL. Recently we’ve made available a columnar index in the Apache Parquet format which can queried and analysed using SQL by multiple big data tools and managed cloud computing services. The analytical power of SQL allows to gain insight into the archives and aggregate statistics and metrics within minutes. We also demonstrate how the WARC web archives can now be processed “vertically” at scale, enabling users to pick captures not only by URL but by any metadata provided (e.g., content language, MIME type) or even a combination of URL and metadata.