PAPERS

PANELS

POSTERS AND LIGHTNING TALKS

WORKSHOPS


PAPERS

TOLULOPE BALOGUN & TRYWELL KALUSOPA

University of Zululand

Developing a Framework for Web Archiving of Indigenous Knowledge Systems in Selected Academic Institutional Repositories in Africa

Indigenous Knowledge is a very important knowledge and developmental tool in Africa. There is a growing trend in the digitisation of heritage materials in Africa but while there is consensus on the importance of digitising Indigenous Knowledge in Africa, the issues of ensuring long-term digital material preservation so that digital information is permanently secured and protected. The paper therefore examines the digital preservation of Indigenous Knowledge in selected academic institutional repositories in South Africa with a view to developing a framework for Web Archiving that will ensure their authenticity, reliability and trustworthiness.

Anchored on the interpretivist research paradigm, the Multiple Case Study Method was used to collect qualitative data for the paper. Data was collected from 5 academic institutions in South Africa. The purposive sampling technique was used to select the participants in the academic institutions. The data for the study was collected through comprehensive face-to-face interviews, observation and content analysis. The Web Archiving Life Cycle Model and the Open Archival Information System (OAIS) Reference Model were also used to generate qualitative data for the paper in order to develop a feasible framework for web archiving of the Indigenous Knowledge and heritage materials in academic institutional repositories.

The practical implication of the paper is how it has dealt with the issues of ensuring long-term digital preservation of digital information to permanently secure and protect the integrity, authenticity and ensure future access to indigenous knowledge in academic institutional repositories in Africa.


DUNJA MAJSTOROVIĆ & KAROLINA HOLUB

Dunja Majstorović, Faculty of Political Science, University of Zagreb
Karolina Holub, National and University Library in Zagreb

Knowledge, Habits and Prospects of the Use of Web Archives among the Academic Community in Croatia

The archiving of Web resources in Croatia has had a history of almost 15 years – it started in September of 2004 when Croatian Web Archive (HAW) was founded. Up until 2011 Web content was collected only selectively while later that year harvesting of the entire national .hr domain and the thematic harvesting began. HAW stores news media (national and local), websites of institutions, associations, clubs, research projects, portals, blogs, official websites of counties, cities, journals and books. The size of the entire Croatian Web Archive in September 2018 accounted over 40 TB while all content is publicly available and can be searched and browsed in several ways via its web site and through Library catalog.

In Croatia there has been no research into the use of HAW, and since the sheer volume of data that can be found in the Web archives is overwhelming, this comes as a surprise. Potential audience that could be interested in HAW is large and includes academic users (researchers and students) as well as public in general. Due to the lack of research, there is a need to further the discussion about Web archive visibility among at least one potential users – the academic community.

Therefore, the goal of this research is to question the academic community in Croatia in order to gain insight into their knowledge of available web archives (both Croatian and international) as well as their habits of web archive consumption. A survey will be conducted among university staff, researchers in research institutions and students within the identified academic fields (disciplines) that have potentially large interest in studying web archives and that could benefit from studying the material they offer. We chose to conduct a research among academic fields (disciplines) as following: history, journalism and mass media, sociology, political science, information science, computer science and linguistics. The survey will use open and close-ended questions to investigate participants knowledge of, habits of use and attitudes towards Web archives.

We hope this research will provide a valuable insight into a topic not yet researched in Croatia but also be a valuable contribution to studying Web archives user communities in general.


JOE CARRANO

Massachusetts Institute of Technology

From the Foundational Web to Founding a Web Archives: Creating a Formalized Web Archiving Program at the MIT Libraries

The Massachusetts Institute of Technology (MIT) had, and continues to have, a high level of engagement and involvement in foundational work on the internet and support mechanisms of the World Wide Web. MIT students also took to the early web with gusto and started creating their own sites, even registering the www.mit.edu domain themselves. From that point thirty years ago, the mit.edu domain and websites produced at the Institute grew and sprawled in many different directions. Despite MIT’s early involvement and their creation of websites, MIT Libraries Institute Archives and Special Collections (IASC) started only recently capturing this content for long-term preservation and access. Where does one begin when there is such a vast amount of material to capture and much of which is included in other web archiving efforts?

This presentation will describe the MIT IASC web archiving strategies from pilot project in 2016-2017 to its building up in the past year and a half into a formalized program. It will discuss our initial selection, curation, and appraisal methods in determining where to start crawling and focus initial energies, including a focus on diversity, inclusion, and social justice values, outreach with web administrators in different departments to develop seed lists, and tools used to determine what has been crawled by others on the MIT domain. This presentation will also describe our approach in developing web archives collections of students and MIT affiliated content with the goal of doing it ethically and with the informed consent of content creators.

Additionally, this talk will go into how MIT IASC developed their web archiving metadata application profile based on archival standards and involvement with a group of emerging practice of describing web archives in the United States. This will include our method of integrating web archives description into archival collections and the ArchivesSpace collection management system. It will also discuss the initial roll out of user access in the reading room and Archive-It website and attempts to promote the collections on campus.


BARTŁOMIEJ KONOPA

Nicolaus Copernicus University in Toruń, State Archive in Bydgoszcz

Studying the past Web in Poland – current state and perspectives

The development of research on the past Web around the world, prompts to ask how does it look in Poland? Currently, there is no general national Web archive in this country, however, there have been and still are leaded projects which aim is to collect and preserve a fragment of that type of resources. Of course, the collections of the past polish Web are also gathered and stored by the Internet Archive foundation. Are researchers aware of their existence and do they see their use in science? Perhaps the research in which they were used have been already carried out? What was their character and subject matter, what sources were used and what methods were used with them? These are some of the main questions that could be asked considering the state of Polish studies related to the past Web.

The main purpose of the paper is to analyze the state of research on the past Web in Poland. First of all, it is necessary to identify the source database, which can be used in the research, so one should answer the question of how does archiving the Web in this country look like. Knowledge about researchers opinion on that type of materials, presented in scientific lectures, would also be useful. Then, the review of publications discussing research on the past Web would be needed as well, in order to identify their issues and how they are carried out. Moreover, getting acquainted with the attitude of Polish scientists to carrying them would be equally interesting. Thanks to getting to know the opinion of employees of the science about the past Web and the prospects of his research, it will be possible to consider the directions of its further development.

The paper will consist of two main parts. The first one will include a discussion on actions taken in Poland related to the preservation of the Web, as well as reflection on its archiving and the sources created during it. This will allow to know what is the foundation for conducting research on the archived Web, which will be discussed in the second part of the paper. Also, there will be presented the results of the survey, which the author will conduct in the representatives of various humanities and social sciences at the Nicolaus Copernicus University in Toruń and their attitude to researching the past Web. It will contain questions about their knowledge and existence of such resources as well as their potential usage in science. In addition, a group of respondents, who previously were using the past Web, will be asked about the character of their works. Then, in this part, the polish researches of the past Web, which results have been presented in scientific publications, will be presented and analyzed. Their characteristics, used sources, attitude to them and methods used during the research will be examined in particular.


MARK PHILLIPS, CORNELIA CARAGEA, KRUTARTH PATEL & NATHAN FOX

Mark Phillips, University of North Texas Libraries
Cornelia Caragea, University of Illinois at Chicago
Krutarth Patel, Kansas State University
Nathan Fox, University of North Texas

Leveraging Machine Learning to Extract Content-Rich Publications from Web Archives

The University of North Texas (UNT) Libraries in partnership with the University of Illinois at Chicago were awarded a National Leadership Grant (IMLS:LG-71-17-0202-17) from the Institute of Museum and Library Services (IMLS) to research the efficacy of using machine-learning algorithms to identify and extract content-rich publications contained in web archives.

With the increase of institutions that are collection web-published content into web archives, there has been growing interest in mining these web archives to extract publications or documents that align with existing collections or collection development policies. These identified publications could then be integrated into existing digital library collections where they would become first-order digital objects instead of content accessible only to discovery by traversing the web archive or though a well crafted full text search. This project is focusing on the first piece of this workflow, to identify the publications that exist and separate them from content that does not align with existing collections.

To operationalize this research, the project is focusing on three primary use cases, including: extracting scholarly publications for an institutional repository from a university domain’s web archive (unt.edu domain), extracting state documents from a state-level domain crawl (texas.gov domain crawl), and extracting technical reports from the web presence of a federal agency (usda.gov from End of Term 2008 web archive).

This project is separated into two phases. The first is increasing our understanding of the workflows, practices, and selection criteria of librarians and archivists through ethnographic-based observations and interviews. The research from this first phase informs the second where we are using novel machine learning techniques to identify content-rich publications collected in existing web archives.

For the first phase of research we have identified and interviewed individuals who have worked to collect publications from the web. We have worked to have a representative group of collection types that align with our three use cases of institutional repository, state publications, and federal documents. Our interviews and subsequent analysis have helped to better understand the mindset of these selectors as well as identify potential features that we can experiment with in our machine learning models as the project continues to move forward.

The machine learning phase of research has focused on building a pipeline to run experiments over sharable datasets created by the project team that cover the three use cases. A number of experiments with both traditional machine learning approaches as well as newer deep learning and neural networks have been conducted and early results have identified areas where we can focus to improve our overall accuracy in predicting publications that should be reviewed for inclusion in existing collections of web published documents.

We hope the findings of this work will guide future work that will empower libraries and archives to leverage the rich web archives they’ve been collecting to provide better access to publications and documents embedded in the web archives.


JEFFERSON BAILEY

Internet Archive

From Open Access to Perpetual Access: Archiving Web-Published Scholarship

In 2018, the Internet Archive undertook a large-scale project to build as complete a collection as possible of open scholarly outputs published on the web, as well as improve the discoverability and accessibility of scholarly works archived as part of past global and domain scale web harvests. This project involved a number of areas of work: target harvests of known open access publications, archiving and processing related identifier and registry services (CrossRef, ISSN, DOAJ, ORCID, etc), partnerships and joint services with projects working in a similar domain (Unpaywall, CORE, Semantic Scholar), and development of machine learning approaches and training sets for identifying scholarly work in historical domain and global scale web collections. The project also identifies and archives associated research outputs such as blogs, datasets, code repos, and other affiliated research objects. Technical development has included new crawling approaches, system and API development, near-duplicate analysis tools, and other

Project leads will talk about their work on web harvesting indexing, access, the role of artificial intelligence and machine learning in these projects, joint service provisioning, and their collaborative work and partnership development with libraries, publishers, and non-profit organizations furthering the open infrastructure movement. The project will demonstrate how adding automation to the already highly automated systems for archiving the web at scale can help address the need to preserve at-risk open access scholarly outputs. Instead of specialized curation and ingest systems, the project has worked to identify the scholarly content already collected in general web collections, both those of the Internet Archive and collaborating partners, and implementing automated systems to ensure at-risk scholarly outputs on the web are well-collected and are associated with the appropriate metadata. Conceptually, the project demonstrates that the scalability and technology of “web archiving” can facilitate automated content ingest and deposit strategies for specific types or domains of resources (in this case scholarly publishing, but also datasets, nanopublications, audio-video, or other non-documentary resources) that have traditionally been collected via more bespoke and manual workflows. Repositioning web collecting as an extensible and default technical approach to acquisition for all types of content has the potential to reframe the practice of web archiving as crucial to all areas of digital library and archive activities.


FERNANDO MELO

Arquivo.pt – Fundação para a Ciência e a Tecnologia

Searching images from the Past with Arquivo.pt

Arquivo.pt is a research infrastructure that enables search and access to information preserved from the Web since 1996. On the 27th of December 2018, Arquivo.pt made publicly available an experimental image search prototype (https://arquivo.pt/images.jsp?l=en).

This presentation will consist on a brief overview of the workflow of the Arquivo.pt image search system followed by a demo and presentation of initial usage statistics. Arquivo.pt image search enables users to input a text query and receive a set of image results that were embedded in web-archived pages.

The workflow of Arquivo.pt image search consists of three main steps, namely;

  1. Image extraction from ARC/WARC files;
  2. Image classification;
  3. SOLR indexing.

On step 1 images are extracted from ARC/WARC files. The input is a set of ARC/WARC files and the output is a set of JSON image indexes.

Each image index has information about a specific image, such as its source URL, title, crawl timestamp or dimensions in pixels, and also information about the page where the image was embedded, such as its URL, page timestamp or page title. Arquivo.pt is using an Hadoop 3 cluster and a Mongodb sharded cluster in order to process large collections of ARC/WARC files and store image indexes in a database.

On step 2 the images retrieved are passed to a GPU cluster, and automatically classified as being safe for work in a scale between 0.0 to 1.0 using neural networks. The input is the set of JSON image indexes, from step 1, and the output is again a set of JSON image indexes, with an added safe field to each image index. All images with a safe score smaller than 0.5 are considered to be Not Safe for Work and are hidden in default image searches.

On step 3 the JSON image indexes obtained from step 2 are indexed using a Apache Solr Cloud. The input is a set of JSON image indexes, and the output is a set of Lucene image indexes (used by Solr). Once step 3 is concluded, the new images are automatically searchable using Arquivo.pt image search system.

An API to enable automatic access to the image search prototype is under development and openly available for testing (https://github.com/arquivo/pwa-technologies/wiki/ImageSearch-API-v1-(beta).


SARA ELSHOBAKY & YOUSSEF ELDAKAR

Bibliotheca Alexandrina

Identifying Egyptian Arabic websites using machine learning during a web crawl

Identifying Egyptian Arabic websites while crawling is challenging, since most of them are not in the ‘.eg’ domain or are not hosted in Egypt. Generally, a crawl begins with initial seeds of curated URLs from which all possible links are followed recursively. In such crawl, using content language as means for deciding what content to include could lead to crawling Arabic websites that are not Egyptian. This is due to the fact that most Arabic websites use the same Modern Standard Arabic form that all native speakers uniformly understand.

A human curator could distinguish the origin country of the website from the spirit of the home page. Clues for such judgement include the topics discussed, the calendar differences between Levant, Gulf, and others, or term usage. For example, the word “bank” is transliterated as-is in some countries, while the formal Arabic translation is used in other countries.

In the last few years, artificial intelligence and especially machine learning showed great achievements helping machines make better sense of context and meaning of data. Currently, there are different machine learning algorithms that are able to analyze a known training dataset in order to build a model. If the model is well trained and designed, it will be able to provide accurate predictions for any new unseen input.

From that perspective, we worked to enhance the quality of Egyptian crawls using the power of machine learning. We started by collecting a few seed URLs from the ‘.eg’ domain and another set of seed URLs from other Arab country domains (e.g., ‘.sa’, ‘.ly’, ‘.iq’). Home pages were harvested and their HTML content was parsed to extract only their plain text. After different pre-processing and normalization phases, features were extracted from the text based on their TF-IDF (Term Frequency – Inverse Document Frequency). The extracted features and their labels were used to train a linear classifier. The output of this process is a trained model that can be used to identify whether a newly encountered Arabic website is Egyptian or not.

As proof of concept, an initial experiment used the Arabic content of 300 URLs equally divided between being Egyptian or not. From that dataset, 90% of URLs were used for training and 10% for testing. The resulting average F1-score is approximately 84%.

In the future, we plan to increase the training dataset and experiment with alternative machine learning algorithms and parameters to enhance the classification accuracy. In addition, we hope to apply the same method to identifying Egyptian websites of different languages.


IVY HUEY SHIN LEE & CHARLES WIJAYA

National Library Board Singapore

Sharing by the National Library Singapore on the journey towards collecting digital materials

The National Library, Singapore is a knowledge institution under the National Library Board (NLB). It has a mandate to preserve the published heritage of the nation through legal deposit and has been collecting works published in Singapore for the last 60 years. As the mandate was limited to physical items, NLB updated its legislation in 2018 to enable it to collect digital publications, including websites to keep up with technological changes. This would strengthen its national collection and create a lasting legacy for future generations of Singaporeans.

The legislative review included a major policy and process reassessment. NLB studied other National Libraries’ legislations, researched on copyright issues, and conducted extensive public consultation to address stakeholders’ concerns. In July 2018, the Bill to amend the NLB Act was passed by the Singapore Parliament, which empowered NLB to collect, preserve and provide access to Singapore websites and electronic publications, amongst other revisions. The changes are slated to take effect in early 2019.

As part of the preparation work for the legislative review, NLB invested in systems and infrastructure enhancement so that it would be able to better process and support the web archiving collection. The Web Archive Singapore (WAS) (eresources.nlb.gov.sg/webarchives) is a portal that hosts the NLB’s collection of archived Singapore websites. First launched in 2006, the original portal only had two functions – keyword search and subject browsing. In August 2018, WAS portal was revamped with a new interface that included five new functions – curation, full text search, public nomination of websites, data visualiser, and rights management. The supporting infrastructures were also enhanced by an in-house Task Management System to manage the selection, crawling and quality assessment of archived websites.

NLB adopts a multi-prong approach to web archiving the nation’s published works online. First, NLB will conduct domain archiving of the more than 180,000 registered .sg websites. Next, it will selectively archive non.sg websites via consent-seeking. These will be done in a systematic manner to ensure that the National Library does not miss out on websites of heritage and research value to Singapore. With the prevalence of social media, NLB has also been exploring and experimenting with the collection of social media content. More resources will be allocated to further explore social media archiving in the coming years.

This presentation will highlight the efforts that NLB took to update the legislation, revamp its web archive portal, and the planning that went into .sg domain crawl as well as other web archiving activities in the pipeline.


FRIEDEL GEERAERT & SÉBASTIEN SOYEZ

Friedel Geeraert, State Archives and Royal Library of Belgium
Sébastien Soyez, State Archives of Belgium

The first steps towards a Belgian web archive: a federal strategy

This paper focuses on the research project PROMISE that aims to set up a long-term web archiving strategy for Belgium.[1]

The project was initiated by the State Archives and the Royal Library in 2017 and will run until December 2019. The goals of the project are to 1) identify (inter)national best practices in the field of web archiving, 2) define and develop a Belgian web archiving strategy and policy, 3) pilot the web archiving service and 4) make recommendations for a sustainable web archiving service in Belgium. The State Archives and the Royal Library partnered with the universities of Ghent and Namur and the university college Bruxelles-Brabant to form an interdisciplinary team encompassing information professionals and legal and technical experts.

Cooperation is at the heart of the PROMISE project. Even though the State Archives and the Royal Library both work within their own legal framework, the Law on archives and the Law on legal deposit respectively, they wish to create a Belgian web archive together and share technical infrastructure and know-how. They have worked on a shared strategy that is based on the Open Archival Information System (OAIS) reference model in order to cover the entire web archiving workflow.

With regards to selection and curation a double approach has been chosen: selective crawls on the one hand and broad crawls on the other. The Royal Library and the State Archives have each created their own seed lists for the selective crawls. The State Archives focused on the websites of public institutions while the Royal Library chose to select (parts of) websites based on certain themes related to its core functions and missions such as Belgian comics or e-magazines. A shared model for descriptive metadata is used by both institutions for these selective collections based on recommendations by the OCLC. This choice ensures interoperability of the metadata so that they can be integrated in a shared access platform without compromising the use of specific metadata models based on archival or library principles. The broad crawl on the other hand is managed by both institutions together and consists of taking a representative sample of the Belgian web. The definition of what can be considered as the ‘Belgian web’ is one of the cornerstones of this task.

Given that web archiving is a new activity for both institutions, interesting lessons can be drawn from these first experiences with regards to organisational approaches and (training in) selection and curation. Curating collections of websites required a significant change in mind-set for the cataloguers who worked on the seed list for the Royal Library for example.

In conclusion, this paper will provide insight into the organisation of the pilot of the ‘Belgian web archive’, the collaborative strategy, the selection and curation of the first web collections and the lessons learnt.

[1] PROMISE (Preserving online multiple information: towards a Belgian strategy) is a BRAIN project financed by the Belgian Science Policy Office.


COREY DAVIS, CAROLE GAGNÉ & NICHOLAS WORBY

Corey Davis, Council of Prairie and Pacific University Libraries, (COPPUL) & the University of Victoria
Carole Gagné, Bibliothèque et Archives nationales du Québec
Nicholas Worby, University of Toronto

True North: the current state of web archiving in Canada

Under the auspices of the Canadian Association of Research Libraries (CARL), the Canadian Web Archiving Coalition (CWAC) is an inclusive community of practice within Canadian libraries, archives, and other memory institutions engaged or otherwise interested in web archiving. The Coalition’s mission is to identify gaps and opportunities that could be addressed by nationally coordinated strategies, actions, and services, including collaborative collection development, training, infrastructure development, and support for practitioners and researchers. In this session, members of the Coalition, including the Chair, will provide the international community with an update on national projects and initiatives underway in Canada, with a special focus in several key areas: the evolving collaborative collections development environment, the development of infrastructure for the repatriation and long-term preservation of web archives data in Canada, and the development of a Canadian copyright code of best practices for web archiving.


NICK RUEST & IAN MILLIGAN

Nick Ruest, York University
Ian Milligan, University of Waterloo

See a Little Warclight: Building an Open-Source Web Archive Portal with Project Blacklight

In 2014-15, due to close collaboration between UK-based researchers and the UK Web Archive, the open-source Shine project was launched. It allowed faceted search, trend diagram exploration, and other advanced methods of exploring web archives. It had two limitations, however: it was based on the Play framework (which is relatively obscure especially within library settings) and after the Big UK Domain Data for the Arts and Humanities (BUDDAH) project came to an end, development largely languished.

The idea of Shine is an important one, however, and our project team wanted to explore how we could take this great work and begin to move it into the wider, open-source library community. Hence the idea of a Project Blacklight-based engine for exploring web archives. Blacklight, an open-source library discovery engine, would be familiar to library IT managers and other technical community members. But what if Blacklight could work with WARCs?

The Archives Unleashed team’s first foray towards what we now call “Warclight” — a portmanteau of Blacklight and the ISO-standardized Web ARChive file format — was building a standalone Blacklight Rails application. As we began to realize this doesn’t help those who would like to implement it, development pivoted to building a Rails Engine which, “allows you to wrap a specific Rails application or subset of functionality and share it with other applications or within a larger packaged application.” Put another way, it allows others to use an existing Warclight template to build their own web archive search application. Drawing inspiration from UKWA’s Shine, it allows faceted full-text search, record view, and other advanced discovery options. Warclight is designed to work with web archive data that is indexed via the UK Web Archive’s webarchive-discovery project.

Webarchive-discovery is a utility to parse ARCs and WARCs, and index them using Apache Solr, an open source search platform. Once these ARCs and WARCs have been indexed into Solr, it provides us with searchable fields including: title, host, crawl-date, and content type.

One of the biggest strengths of Warclight is that it is based on Blacklight. This opens up a mature open source community, which could allow us to go farther if we’re following the old idiom: “If you want to go fast, go alone. If you want to go further, go together.”

This presentation will provide and overview of Warclight, and implementation patterns. Including the Archives Unleashed at scale implementation of over 1 billion Solr docs using Apache SolrCloud.


NIELS BRÜGGER & DITTE LAURSEN

Niels Brügger, School of Communication and Culture – Media Studies, Aarhus University
Ditte Laursen, Royal Danish Library

A national Web Trend Index based on national web archives

A number of historical studies of national web already exist (Brügger and Laursen, in press.) but systematic basic information about a national web and its changes over time is lacking. This could be information about the number of websites, of specific file types, or of hyperlinks to social media platforms, as well as information about hyperlink structures or prevailing languages on a national web.

In this presentation, we will argue for the establishment of what we call a national Web Trend Index. Such an index can support future studies of the history of the web and be relevant for researchers, web archives, web companies, and civil society as an important source to understand national webs and their historical development. The national Web Trend Index should provide metrics for how national web domains have developed over time, and it must be flexible enough as to accomodate for new metrics to be included as the online web, the web collections, and the interests of all stakeholders change. The presentation will illustrate some of the most obvious metrics to include in such a national Web Trend Index and we will outline how the index can be built based on a systematic, transparent and reproducable approach. We will argue that a national Web Trend Index is best made and sustained in an organisationel setup including curators, developers and researchers. Finally, transnational perspectives for a Web Trend Index are discussed.

Brügger, N., & Laursen, D. (Eds.) (In press). The historical web and Digital Humanities: The case of national web domains. Routledge.


JASON WEBBER

The British Library

Using Secondary Datasets for researchers under a Legal Deposit framework

The UK Web Archive (UKWA) is a partnership of all six UK Legal Deposit Libraries that has attempted to collect the entire UK Web Space at least once per year since 2013. This material is collected under the Non-Print Legal Deposit Act 2003. This act allows UKWA to archive, without permission, all digitally published material that can be identified as UK owned or based. This generates millions of websites and billions of individual assets all of which is indexed. This vast resource is, however, strictly only viewable on the premises and within the control of UK Legal Deposit Libraries.

Whilst UKWA has developed a new interface that makes searching the Legal Deposit collection possible it doesn’t remove the significant barrier for researchers of having to come to a Library, apply for a readers pass (not simple) and use a Library terminal under some strict viewing conditions. This is only the barrier for researchers wanting to look at a few web pages. There is currently no easy to use facility for researchers wanting to do big data analysis across the whole Legal deposit collection, in large part due to having to do that research on-site at a Library.

A possible (and partial) solution is the use of secondary datasets. UKWA is legally unable to supply researchers with the actual websites or text or, in fact, anything that can be used to reconstruct the original works. What is possible, however, is supply facts about the collection and these facts can be incredibly valuable to researchers.

This presentation will discuss two use case projects that have utilised secondary datasets that have been created by researchers with help from UKWA staff. The first project used geographical data extracted from the UK Web and compared it to information available on businesses through Companies house. The second project used an algorithm to attempt to identify the polarity of words in the UK over time – how words may have changed their meaning.

The use of secondary datasets within web archiving can potentially solve the difficult legal position of many national libraries that collect under legal deposit or other strict access conditions. This presentation will, in part, be a call for more work to be done in this area to create environments that researchers can either work with existing datasets or create their own.


SUMITRA DUNCAN & LORI DONOVAN

Sumitra Duncan, Frick Art Reference Library
Lori Donovan, Internet Archive

Advancing Art Libraries: developing a collaborative national network for curated web archives

In mid-2018 the Internet Archive and the New York Art Resources Consortium (NYARC)—which consists of the Frick Art Reference Library of The Frick Collection and the research libraries of the Brooklyn Museum and The Museum of Modern Art—received a one-year National Forum grant from the Institute of Museum and Library Services (IMLS) in the Curating Collections project category: Advancing Art Libraries and Curated Web Archives: A National Forum. As part of this project, a National Forum and workshop will convene in February 2019, at the San Francisco Museum of Modern Art (SFMOMA), with librarians, archivists, and curators attending from diverse organizations, many of which are active members of the Art Libraries Society of North America (ARLIS/NA).

This project began with an initial round of outreach, research, and reporting that identified and summarized the challenges, opportunities, and potential areas for collaboration within the North American art and museum library community. Convening at the National Forum will allow this group of approximately 50 art librarians and archivists to coordinate current collection development practices, assess resource and program needs, and map out a national network for future collaborations and service models.

The Advancing Art Libraries and Curated Web Archives project naturally builds upon the Internet Archive’s more than 20 years of experience in web archiving and community building around digital stewardship, as well as NYARC’s successful program of art-specific web archiving to leverage joint expertise with a plan of action to catalyze the art and museum library community and create a roadmap for a sustainable national program of art-specific web archives.

A coordinated effort on program development at a networked level will ensure that at-risk born-digital art documentation and information will be collected, preserved, and remain accessible as a vital resource for current and future research. It is a central objective for NYARC and the Internet Archives to disseminate not only the research and resulting publications from this project, but to share the determined roadmap and collaborative model beyond the North American art library community and with those involved in web archiving efforts via the International Internet Preservation Consortium (IIPC). In this presentation, members of the project team, Sumitra Duncan, Head of the Web Archiving Program for NYARC, and Lori Donovan, Senior Program Manager, Web Archiving at the Internet Archive, will share key takeaways resulting from this initiative.


JESSICA CEBRA

Stanford University

Describing web archives: a standard with an identity crisis?

I’ve engaged with web archives for about one year in the role of metadata management librarian. At Stanford, our basic metadata requirements for archived websites are generally modeled after records for other digital resources in the Stanford Digital Repository, but with some tailored fields unique to captured web content. As I began to delve into this new world, with the perspective of a trained archivist, I was struck by the prevalence of bibliographic-oriented descriptive practice across institutions, and wondered where is the archival description in web archives?

In the web archives community, recent publications of recommendations, best practices, and metadata application profiles, promote consistency and tackle the challenges of describing web content for discovery (other efforts to update descriptive standards to encompass born-digital materials are also notable). While some of the recommended approaches claim to bridge and blend both bibliographic and archival description, they are primarily bibliographic in nature.

In light of these developments, along with recent literature highlighting user needs and what they deem is missing from descriptive information, this paper examines existing descriptive records for a diverse sampling of web archives and their employment of bibliographic and/or archival description standards, and ultimately, what “useful” information is gained or lost in a comparison of these approaches. As expected, description is often about the website content itself, but there is a rising call for more transparency in how and why the content was captured since the collector is involved in shaping the focus of a collection and configuring the crawls, ultimately intervening in the way a website plays back (though technical limitations often play a part in a mementos “incompleteness”). Could this descriptive gap be filled with something akin to an archival ‘acquisition information’ or a ‘processing information’ note that provides contextual and technical details, from seed selection criteria to crawling tools and techniques used in the processing of the material?

At Stanford, collectors of web content are librarians, archivists, and other academic staff, known as the Web Archivists Group. It is my hope that this paper will spark a more focused and informed conversation, not only within the group, but in the broader community as well, about what descriptive information is useful, and to whom? And, to apply those decisions to our descriptive practice as it evolves moving forward.


SABINE SCHOSTAG

Royal Danish Library

Why, what, when and how: Curatorial decisions and the importance of their documentation

The Royal Danish Library is by the legal deposit law obliged to collect and preserve the Danish Web sphere. You will never be able to archive the entire web or one nation’s entire web sphere. You have to make choices – and particularly – to keep trace on your decisions. This is not only of prime importance for the curators work, but is definitely also to the benefit of the users/researchers.

To master the task of archiving the Danish web sphere, the web curator team laid down a strategy: up to for broad crawls (snapshots) a year and a number of ongoing selective crawls. Our in-house developed curator tool, NetarchiveSuite, dos not offer enough functionalities and space for the documentation of all our decisions, in particular not for the selective crawls.

Thus, we had to decide on a documentation tool. We wanted a tool, which

  • Was easy to access for all involved persons
  • Was easy to edit, make changes, ad content
  • Offered the options to document
    • selections and deselections with reasons for decisions
    • start and end of crawl periods,
    • QA observations and follow-ups

We build an internal folder system within the Windows pathfinder. The folders represented the different steps of the workflow for selective crawls predetermined by the curators: identification of a domain to be crawled selectively, initial examination, analysis, data entry, quality assurance, monitoring and follow-up. We created a Word template and filled in a template for each selective crawled domain. Then we moved the documents around in the folder system according to their stage in the workflow. However, opening, editing and moving the documents between the folders according to the workflow required us to watch our step and soon it became rather difficult to handle. We started with moving the content of all domain documents to wiki-pages in the MediaWiki (https://www.mediawiki.org/) and ended up by migrating all our documentation to the Atlassian products, the jira https://www.atlassian.com/software/jira and the confluence wiki (https://www.atlassian.com/software/confluence). An important factor for this choice was the access management: We can assign individual access for every single page or deny it for pages with private content.

We converted the workflow for selectively crawled domains into a modified jira space (issue tracker). The status of the issue represents the step within the workflow. We transformed each selectively crawled domain into an issue.

In this way, we now have a flexible documentation system, particularly with regard to the selective crawls. By using a range of components (such as “with paywalls”, “uses https protocol”, “uses advanced JavaScript”, etc.), which can be added to any issue (domain), we can easily group the selectively crawled domains according to different challenges and, for instance, forward problems to be solved by a developer to the developers’ jira space within the system.


LORENZ WIDMAIER

Cyprus University of Technology

Divide-and-conquer – artistic strategies to curate the web

Twenty-five thousand photographs are within the collection of MOMA New York. In contrast, 95 million photographs are uploaded on Instagram each day. Digitisation strategies of memory institutions are often merely about digitalising physical objects, made accessible in databases like Europeana. If born-digital content is taken into consideration, it is often treated in the same manner as physical objects, differing only in the techniques needed for access and storage.

These techniques are indeed needed. Nevertheless, we should shed light on the inherent character of born-digital content, and its genesis within an algorithmic, data, social media, platform, or networked society. Aleida Assmann argued that a focus on storing is not enough, considering the inexorably growing amount of data, and pointed on the importance of forgetting. The selection process of what to remember and what to forget cannot entirely be outsourced to search engines. Instead, professional ‘gatekeepers’ are still needed (Assmann 2018: 202f). How can the GLAM sector engage with the masses of volatile and dynamic born-digital content?

We will take a look at artistic strategies dealing with the flood of data within the digital society. What can archivists learn from these approaches? We examine works using Google StreetView images to tell about the world, like Michael Wolfs “Interface”, or like the “Agoraphobic Traveller”. We will listen to “Quotidian record”, a soundtrack created by Brian House, using his location-tacking-data of a year. Lev Manovich’s argument for “cultural sampling” will be contrasted by “@paintguide” from Henrik Uldalen and Daniela Bezdan. We will explore the remarkable archive of Peter Piller and the work of Mishka Henner who deals with satellite images and YouTube videos. As an example of revealing the hidden parts of the Internet, we will see a shopping bot created by “Mediengruppe !Bitnik”, who bought automatically random items in the darknet.

Can these artistic practices, their specific form of web curation, help the GLAM sector to discover new methods to archive the web? In a way that goes beyond storing websites as one-to-one copies? What can archivists as ‘gatekeepers’ learn from artistic strategies to curate and preserve an intelligible story of the web?

Assmann, Aleida (2018): Formen des Vergessens. Sonderausgabe. Bonn: bpb Bundeszentrale für Politische Bildung (Schriftenreihe / Bundeszentrale für Politische Bildung, Band 10296). p. 202f


MIROSLAV MILINOVIĆ & DRAŽENKO CELJAK

SRCE – University of Zagreb University Computing Centre

From Web Measurement to Archiving – Technical Development of Croatian Web Archive Over the Past 15 Years

Although the Croatian Web Archive was launched in 2004, the development of the appropriate tool(s) started earlier with the project of Croatian Web Measurement (MWP). In 2002 the team from SRCE – University of Zagreb University Computing Centre measured the Croatian Web for the first time. The goal was to estimate the size and the complexity of the Croatian Web and to acquire the basic information about its content. For that purpose, the team developed a custom software. Following on gained experience and in cooperation with the National and University Library, the software developed for MWP project was expanded with the capability of web capturing and archiving. In addition, the web interface was developed for configuring and managing the capturing process.

Based on that early results Croatian Web Archive has been officially launched in 2004 as the system for gathering and storage of the legal deposit copies of Croatian web resources with scientifically or culturally relevant content. Selective capturing of web resources was based on National and University Library’s online catalogue. Over the time, selective capturing was complemented with the domain harvesting and thematic harvesting features.

Today Croatian Web Archive contains a collection of more than 63.000 instances of webs sites as a result of selective capturing, 8 domain harvests and 10 thematic harvests. All of that content is available online for end users and for services via OAI-PMH interface. End users can browse and search Archive’s content using various criteria.

First harvesting of the Croatian top-level internet domain .hr took place during July and August 2011. At that time, we got valuable experience which let us enhance the architecture of the harvesting part of the system to make it more efficient and faster.

This talk puts emphasis on experiences gained through the process of planning, execution and analysis of the results of web harvesting, selective web capturing and web measurement. We present the technical challenges we have encountered over time and remedies we used to ensure desired functionalities of the Croatian Web Archive.


MÁRTON NÉMETH & LÁSZLÓ DRÓTOS

National Széchényi Library

Metadata supported full-text search in a web archive

Content from web-archives can be retrieved in various levels. The most simple solution is the retrieval by URL. However in this case we must know the exact URL address of the archived webpage in order to retrieve the desired information. The next phase is to search on the title (or other metadata elements can be found in the source code) of the homepages. Texts of links that are pointing to a website can also be searchable. However relevant hits can be retrieved in this case only by individual website-level. Although metadata also can be extracted from various archived file types (like HTML, PDF), by our experiences these kind of metadata are often missing and even if those are exists, they are sometimes too general or ambiguous. So search on exact, narrow topics is only available by a full-text search function. In this case, ranking by relevance is the biggest challenge. Google has a ranking algorithm that has developed for 20 years and using more than 200 parameters. This company also building an enormous database based on the search and retrieval preferences, interactions and other user-based features. These algorithms and databases are not available for national libraries.

Through the running of the Hungarian Web Archiving Project we have started an experiment in order to find how to use of website-level metadata that are being recorded by librarians (e.g. genre, topic, subject, uniform title) for filtering retrieval lists generated by full-text search engines, how to refine search queries and how to display retrieval hits in a more comprehensive and user-friendly way.

By the first part of the presentation we are offering a brief overview about the metadata structure that is currently being used at the National Széchényi Library. This schema is following the recommendation set of OCLC Web Archiving Metadata Working Group. Then we briefly present the Solrwayback search engine developed by Danish partners, which is currently running and being on test on our demo collection. In the following we would like to introduce another Solr-based search system that has developed in the National Széchényi Library that can retrieve and take into account data from XML-based metadata records. In the last part of our presentation we would like to offer an overview about some future opportunities of metadata enrichment by automatically retrieved information from namespaces and thesauri. In this way we could add a semantic layer to the search and retrieval process of web-archives.


RAFAEL GIESCHKE & KLAUS RECHERT

University of Freiburg

Preserving Web Servers

Preserving Web 2.0 sites can be a difficult task. For the most basic Web sites (“static Web pages”), it is sufficient to preserve a bunch of files—a task which can also be done from the outside of the system using a harvester—and serve them using any Web server system. With the advent of the so called Web 2.0, an harvesting approach is limited as theses sites use server-side logic to process the user’s requests. While for some of the cases, especially if the range of inputs is known and fixed, a technique of recording HTTP requests and their respective responses (as used by webrecorder.io) can be employed, for more advanced and especially interactive cases, traditional harvesting techniques have their limitations. These cases include retired content management systems, intranet servers, database-driven Web frontends, scientific (project) Web servers with functional services (WS/REST/SOAP), digital art, etc.

We present a concept for preserving the computer systems running the Web servers themselves instead of only harvesting their output. This approach promises a more complete preservation of the original experience. The preserved systems are then accessible on-demand using the Emulation as a Service framework. One of the main challenges for access workflows is the security of archived machine. As these machines are archived to remain in their original state, a (permanent) Internet connection could be harmful. We present a solution for securely relaying the requests of a user’s Web browser (or any other Web client) to these emulated Web servers. Different access scenarios are supported, e.g., using a current Web browser, orchestrated access using an emulated Web browser (e.g., for Web sites featuring Adobe Flash or Java applications) as well as a “headless” mode for script or workflow integration.

For a more complete user experience, integration of the presented techniques with traditional harvesting techniques is the next necessary next step. For instance, a preserved Web server might itself depend on external data from other Web pages no longer accessible on the live Internet, which, though, have been preserved by a harvester, and vice versa, such that a new level of orchestration for various Web preservation and access methods becomes necessary.


GIL HOGGARTH

The British Library

The Infrastructure behind Web Archiving at Scale

This conference promotes the value of Web Archiving and explains the services and tools to create such a system. However, anyone who has ever ventured to put together these components for a production service (or just the preparation for a production service) will appreciate the complexity of the challenge. And before that first production service exists, the size of task – especially in terms of the size of the data being handled – is often underestimated.

This presentation will delve into the conceptual areas of a production Web Archiving service that can manage both the volume of data and the impact that volume has on processing. These high level areas include:

– The management of website targets, crawl dates, access licence/s, inclusion into subject collections
– Web site crawling
– Storage of crawled data as WARC files
– The link between website URLs and WARC records, handled by a CDX service
– Website presentation via a wayback player
– Making the crawled data searchable
– Managing access to the crawled website data

During the presentation, hopefully, an over-arching infrastructure should become clear that will help individuals and institutions alike to appreciate the necessary, and optional, components that make up a Web Archive service. After the presentation, this visual overview will be made available for attendees of the conference to consider and annotate, so that this becomes an enriched document of the components used by the (attending) Web Archive community.


TOBIAS BEINERT, MARKUS ECKL & FLORENCE REITER

Tobias Beinert, Bavarian State Library
Markus Eckl, University of Passau, Department of Digital Humanities
Florence Reiter, University of Passau, Jean Monnet Chair for European Politics

Archiving and Analysing Elections: How can Web Archiving, Digital Humanities and Political Science go together?

The Bavarian State Library has been running selective web archiving activities since 2011, however the academic use of the archived objects is still constrained to a reading-based approach. Additionally the collection development for the web archive is up to now based on the decisions of the library’s staff.

To overcome these shortcomings the Bavarian State Library has teamed up with experts from the Chair of Digital Humanities and researchers of political science from the Jean Monnet Chair at the University of Passau. In a joint study methods and software from the Digital Humanities as well as experimental tools developed in the Web Archiving Community will be applied to datasets of web archive collections. Hereby the focus lies on testing innovative and intuitive ways of accessing web-based resources and implementing approaches for an automated and user-based collection development. To prove that those methods and instruments are useful in academic settings, a case study on the Bavarian state election (2018) as well as the European Parliament election (2019) is conducted. The case study explores, if and how web archives can empirically answer the question how political actors and parties frame the European Union throughout their election campaigns, i.e. in which regard issues are labelled as “european” or “national”. For studying the EU, the framing perspective is particularly relevant. According to Jörg Matthes frames can be defined as “selective views on issues – views that construct reality in a certain way leading to different evaluations and recommendations”, so the concept is useful when dealing with political communication.

The presentation gives a first insight in the challenges of an election event crawl and the first steps of preparing and analysing the produced data. It will illustrate and evaluate the tools (Web Curator Tool/Heritrix, webrecorder) used for crawling different sources (websites, social media, news sites) and data analysis. The perceived gap between crawling the web as a means of standard collection development in a library and producing data sets for specific research purposes will also be addressed as it can help to lay the basis for the provision of a scientific theoretical framework for web archiving. The paper therefore not only discusses the results of the case study, but also addresses methodological questions along the research process. It thereby focuses on the interplay of specific disciplinary questions and requirements of libraries, political science, and digital humanities methods.

We thus aim to contribute both to the debate on the scientific value of web archives in general as well as the question which methods are a suitable use for research in web archives.

Researcher case studies Use, usability and access to web archives and web archive datasets


LYNDA CLARK

British Library/Nottingham Trent University

Emerging Formats: Discovering and Collecting Contemporary British Interactive Fiction

The British Library’s ongoing Emerging Formats project seeks to archive complex digital works and, crucially, make them accessible. This case study is concerned with one type of Emerging Format, digital interactive fiction. This case study seeks to provide an overview of contemporary British interactive fiction, while highlighting the difficulties (and potential solutions) associated with selecting, archiving and offering access to such varied and complex material.

Digital interactive fiction, like many digital technologies, is an area of rapid growth and change. New tools and sharing platforms for writers and readers of interactive fiction emerge, while others become obsolete, or are lost altogether. In 2017, interactive fiction platform Intudia launched, while in August 2018, Inkle’s Inklewriter was officially shut down (although for now it remains online). ChoiceScript, Twine and Quest continue to be in general use for readers and writers of interactive works, while SubQ remains the only major paying online magazine publishing interactive work.

Since ‘[n]ot all groups or individuals creating publications define themselves as “publishers” and may not view their work in terms of a “publication”’ (Smith and Cooke, 2017, p. 176) there is very little standardisation in terms of production processes and archiving. Most major works appear on the Interactive Fiction Database, but the amount of information provided for each entry varies wildly, with some providing merely a record of the works’ existence, but no means to play or view it. Creators using tools in unusual ways and seeking to subvert genre or format expectations further complicates matters for archivists, researchers and readers alike.

This case study attempts to capture a snapshot of the digital interactive fiction platforms and tools in use across the UK and suggest how these works might be retained for future researchers and readers.


RADOVAN VRANA & INGE RUDOMINO

Radovan Vrana, Faculty of Humanities and Social Sciences, University of Zagreb
Inge Rudomino, National and University Library in Zagreb

Croatian Web portals: from obscurity to maturity

Over two and half decades have passed since the introduction of the Internet in Croatia and since the country’s top-level domain .hr came to life. The very first Croatian web sites within the .hr top-level domain were those of the several Croatian academic institutions as very few other institutions or individuals had access to the internet or had a web site at that time. This situation with the number of newly published web sites improved soon as the web has become more available to many people serving as the publishing platform. This was also the era in which first Croatian portals appeared. Approximately at the same time when the first Croatian portals appeared in 1997, the Law on Libraries in Croatia was passed. It introduced a new amendment regarding the legal deposit provision to include online publications like web portals into the legal deposit. The Law on Libraries was also the basis for the development of the Croatian Web Archive. The Croatian Web Archive is joint project of the National and University Library in Zagreb and the University Computing Centre, University of Zagreb. Its tasks are to collect, store and give access to online resources. The content of the Croatian Web Archive is harvested daily, weekly, monthly, annually, etc. Additionally, annual harvestings of the top-level domain (.hr) and thematic harvestings of the important events in Croatia are conducted throughout the year. The harvestings also includes web portals, the most dynamic form of web sites changing from one harvesting event to another.

Our focus will be on the web portals as the most dynamic form of web sites and changes they have gone through over time that can be observed in the Croatian Web archive. The analyses will show changes in design, content layout, URLs, titles etc. with the aim to establish the major development phases the Croatian web portals have undergone.


ANDREW JACKSON

The British Library

The Technological Evolution of the UK Web Archive

Our curators and researchers select sites from the the live web: telling us what and how we should harvest, and building up collections and other metadata to describe what we’ve got. But we also need to curate the _archived web_.

When researchers work with us to understand the past, or when our decade-old curatorial efforts have to be revisited to reflect today’s needs, we need to break apart the data from the metadata and reshape how we describe what we’ve crawled. The metadata that made sense to the curator back when we harvested the material is not necessarily what today’s user needs in order to make sense of what we’ve got.

This presentation will look back over more than a decade of curator-driven crawling at the UK Web Archive, and explore how Legal Deposit, technological changes, and the process of developing a new user interface have forced us to revisit how we describe our holdings and led to new ways of handling how we manage our metadata. Instead of our metadata being closely bound to the crawl process we have begun to separate the two, starting by turning the crawl metadata into annotations of the live web that can be used to direct the crawl.

More recently, we have found we need to go further, and separate the metadata we need to _drive the crawls) from the metadata we use to _describe what’s crawled_, with the latter re-framed as annotations on the _crawled material_ rather than associated with the crawl target or the live web. This new approach breaks with the past but will make it possible for our collegues, readers and researhers to understand, cite and describe the both current and archived web in a consistent and interoperable manner.


JULIEN NIOCHE & SEBASTIAN NAGEL

Julien Nioche, CameraForensics
Sebastian Nagel, CommonCrawl

StormCrawler at Common Crawl and other use cases

StormCrawler is an open source collection of resources for building low-latency, scalable web crawlers with Apache Storm. This talk will introduce the project and its main features as well as the eco-system of tools that can be used alongside it. We will give a few examples of how it is being used by different organisations around the world with varying volumes of data.

The second part of the talk shortly presents the Common Crawl News data set, a continuously growing collection of news articles from news sites all over the world. We demonstrate how StormCrawler is used as an archiving crawler and is adapted to fit the requirements: utilisation and detection of news feeds and sitemaps, the prioritisation of recently published articles and the challenge to avoid crawling historic news from the archive sections of news sites and agencies.


ELÉONORE ALQUIER

French Audiovisual Institute (INA)

From linear to non-linear broadcast contents: considering an “augmented audiovisual archive”

The French Audiovisual Institute (INA) has mission to collect, preserve, restore and communicate France’s radio and television heritage. The law of 20 June 1992 has given INA responsibility for the legal deposit of broadcast audiovisual materials. Since 2006 when the French legal deposit was extended to online public web contents, INA dlweb (“dépôt legal du web”) team has been responsible for the collection and preservation of French audiovisual (AV) and media related web contents.

Considered at first as additional documentation for broadcast media, that scope has evolved over time to become a full-fledged media extension, a major replay platform, and even,a broadcasting channel, gradually shaping its own editorial codes and logic. In consequence, INA methods for documenting and accessing TV and Web archives have adapted to the evolving media ecosystem, highlighting interactions between TV, radio and the web, in terms of production and consumption of these media. In parallel as INA archive data models are being redefined and new documentation tools developed, the opportunity to better articulate web and broadcast archives was seized.

Beyond sourcing, archiving and giving access to audiovisual websites since 2009, INA has thus expanded collections to video and social network publications, and developed specific tools and methods to improve user (researcher) experience. This archive is now approaching 80 billion contents, with 880 million tweets and over 2 million hours of videos. Dedicated access tools have been developed, including search engines and assisted browsing. The necessity to enhance the relation between this huge mass of information and INA TV and radio collections has quickly emerged, and the current sourcing, curation and access methods reflect this need, especially regarding the Twitter archive.

This submission aims to present the evolution of (traditionally linear) audiovisual archiving methods, granting the necessity to consider related web contents : why and how broadcasters tend to create non-linear contents, impacts of these new practices on collecting, documenting, curating and accessing this so-called “augmented archive”. Issues will be tackled, such as, how to guarantee coherence of audiovisual collections when a linear medium tends to produce more and more web exclusive contents, or, what is the impact on expected skills for information professionals?

The way broadcasters give and develop new access to their contents has to be taken into account to define, not only the processes of collecting and archiving, but also the design of user interfaces and tools that an archive institution will provide. From collecting to curation, broadcasters’ online practices challenge our ability to adapt to a permanently evolving archive.


DANIEL GOMES

Arquivo.pt – Fundação para a Ciência e a Tecnologia

Arquivo.pt Memorial and other goodies

Arquivo.pt is a research infrastructure that preserves information gathered from the web since 1996 and provides public online services to explore this historical data. Arquivo.pt contains over 4 billion pages collected from 14 million websites in several languages and provides user interfaces in English. In 2018, Arquivo.pt received over 170 000 users and 78% of them were originated outside of Portugal. Despite focusing on the preservation of the Portuguese web, Arquivo.pt has the inherent mission of serving the scientific community, thus it also preserves selected international websites. The search and access services over the archived data are stable since 2016. This presentation will highlight the main innovations performed during the past 3 years to develop new services and expand the user community.

Efforts were focused on developing added-value features that would extend the utility of Arquivo.pt to new usage scenarios such as robustify.arquivo.pt, that was based on René Voorburg robustify.js project, or the Arquivo.pt Memorial.

Robustify.arquivo.pt is a mechanism that automatically fixes broken links in web pages. When a broken link is detected, it provides web users the option to access a previous version of the linked web page that was preserved by a web archive. Webmasters just need to add one single line of code to benefit from this feature on their websites and stop worrying about fixing the numerous broken links that arise among their older pages.

There are websites that are no longer updated with new content but have to be kept online because they provide important information, such as websites that document finished projects. However, the cost of maintaining these stale websites increases over time due to the obsolescence of the technologies that support them and that very often causes dangerous security vulnerabilities. The Arquivo.pt Memorial offers high-quality preservation of websites’ content with the possibility of maintaining their original domains. For example, UMIC – Knowledge Society Agency was a public institute that existed from January 2005 to February 2012 and its website was deactivated in 2017. However, its official domain www.english.umic.pt remains active but references a version preserved in Arquivo.pt.

The strategy adopted to extend the user community focused on stimulating training among power users so that they become disseminators of Arquivo.pt among their own communities. A training program about web preservation was put in place (arquivo.pt/training) with the objective of raising awareness to the importance of preserving online digital heritage. In 2018, we launched our first training activity for an international audience with a tutorial named “Research the Past Web using Web archives” as part of the TPDL conference. At the same time, other dissemination activities were performed such as the production of training videos, dynamization of social network channels by posting links to web-archived pages related to a calendar of national and international celebrations, creation of collaborative collections (e.g. national elections) and a public exhibition of posters that highlight historical web pages.


MARIA PRAETZELLIS, SHARON MCMEEKIN & ABBIE GROTKE

Maria Praetzellis, Internet Archive
Sharon McMeekin, Digital Preservation Consortium
Abbie Grotke, Library of Congress

Building the IIPC Training Program

This presentation will showcase the outcomes of IIPC’s Training Working Group, which is building a high quality web archiving curriculum for IIPC members, web archivists, and technologists engaged in preserving web materials. Practitioners have varying approaches to archiving reflecting different institutional mandates, legal contexts, technical infrastructure, etc, but share a need for expert training models and instructional methods. The TWG has been funded by the IIPC to fill this need by creating a series of openly accessible educational materials and training. This foundational work aims also establish a framework for the creation of focused, topical training and educational resources going forward.

Together with the Digital Preservation Coalition, the TWG has produced the first set of training materials designed for the beginning practitioner covering technical, curatorial and policy related aspects of web archiving. Beyond the core purpose of providing this level of baseline training, the training materials can also be used by IIPC members and the larger digital preservation community for marketing and outreach, internal and external advocacy, and ongoing program and professional development.

This first delivery of training materials will be released to coincide with the IIPC WAC 2019 and includes 13 modules on topics ranging from web archives as primary sources to building a business case for ongoing program funding. During the presentation program chairs from the IIPC and DPC will share the educational materials produced as part of this project, including slide decks, online videos, and teaching and workshop plans. The session will seek feedback from attendees on future areas of curriculum development as well as further ideas for additional instructional approaches. As the TWG works towards developing intermediate and advanced level training materials, presenting at WAC 2019 will provide the opportunity for greater community involvement in the TWG’s work as the program advances.


JULIE FUKUYAMA & SIMON TANNER

Julie Fukuyama, National Diet Library
Simon Tanner, King’s College London

Developing impact assessment indicators for web archiving – making a proposal for the UK Web Archive

This paper presents the results of a study to examine, determine and propose the optimal approach to develop impact assessment indicators for the UK Web Archive (UKWA). In the United Kingdom, legal deposit libraries collaboratively operate a nationwide web archiving project, the UKWA, which has collected over 500 TB of data and is growing by approximately 60–70 TB a year. At the same time, UK publicly funded organisations face reduced funding and the challenge of convincing funders to finance their archival function by undergoing evaluations of their services’ values.

Under such circumstances, a proper assessment of the values and impacts of web archiving is a point of discussion for cultural heritage organisations. To the best of the authors’ knowledge, there has not yet been a comprehensive assessment or evaluation of the UKWA conducted. Thus, this paper seeks to answer the research question: “What would the indicators of impact assessment for the UKWA be?” As a result, we propose a set of impact assessment indicators for the UKWA (and web archiving in general) with broad strategic perspectives including social, cultural, educational and economic impact.

This study examines and proposes the optimal approach to develop impact assessment indicators for the UKWA. The research began by analysing the literature of impact assessment frameworks for digital resources and the types of impact in related fields. Primarily drawing from Simon Tanner’s Balanced Value Impact Model (BVI Model), this research then proposes impact indicators for the UKWA and develops an impact assessment plan consisting of three stages: context setting, indicator development, and indicator evaluation.

This paper will present the method and results of the study. Firstly, it identified the UKWA’s foundational context, the mission, the principal values and the key stakeholder groups. The research project prioritised focal areas for the archive that seem most advantageous for stakeholders and aligned with Tanner’s Value Lenses. Secondly, we proposed the UKWA impact assessment indicators; scrutinising existing indicators and various evidence collection methods. In the third stage, the developed indicators’ functionality was checked against set quality criteria and then tested through semi-structured interviews and survey submissions with 8 UKWA staff members.

Finally, the paper presents the thirteen potential indicators for the UKWA. Based on the lessons learned, presenters will also make recommendations for organisations which recognise the necessity of undertaking impact assessments of their web archives.


RICARDO BASÍLIO & DANIEL BICHO

Arquivo.pt – Fundação para a Ciência e a Tecnologia

Librarians as web curators in universities

Arquivo.pt aims to expand its community by widening the purposes for which web archives can be useful, such as research on Humanities or preservation of institutional memory about Portuguese universities. However, there is not yet a group of practitioners on web curation or a team of researchers familiar on dealing with preserved web-content.

This presentation argues that librarians in universities have an important role in supporting the exploration of preserved web-content. They can contribute as local experts on web preservation. By taking on the curation of institutional websites, librarians will acquire skills and knowledge about web archiving and will help researchers to use web archived materials. To achieve that, a training for librarians has been prepared in three parts. This project aims to reach researchers in order to get their real requirements, so that Arquivo.pt and librarians as web curators can properly respond.

The first part is an introduction about the fundamentals of web archiving, namely, technologies, projects and terminology required to contextualize and integrate newcomers to this area.

The second part takes the trainees to make an experiment on how web preservation can be performed at a very small scale. Web preservation is presented as a three-step sequence: capture, store and replay. During this training session, webrecorder.io is used for capturing a set of institutional websites, social pages, embedded video and audio relevant to the trainees. The resulting WARC files are stored in a local folder. Finally, Webrecorder Player replays WARC files offline in a local environment.

The third part of the training provides librarians with a set of practical suggestions for reusing web-archived content in their own institutions (e.g. lists, online exhibitions following the case of use www.memoriafcsh.wordpress.com, posts on social media).

As a result, librarians participating in this training are expected to be able to acquire basic terminology and working knowledge of the technologies involved, carry out small projects of capture, store and replay of WARC files and to exhibit and share preserved web content that documents the memory of institutional websites, regardless of the web archive where they are stored (e.g. Arquivo.pt, Internet Archive, Webrecorder collections).

BAD is the Portuguese association of information science professionals with 175 registered libraries. As curator of Arquivo.pt and also librarian, the author of this presentation has proposed the inclusion of these topics in the training program for librarians of BAD and has been accepted.

Since October 2018, webinars have taken part, presentations in the World Digital Preservation Day (WDPD2018) and other workshops are scheduled till the end of the scholar period. Participants has been mostly librarians.


ALEJANDRA MICHEL

University of Namur

The legal framework for web archiving: focussing on GDPR and copyright exceptions

Web archiving is intimately linked to the freedom of expression protected by Article 10 of the European Convention of Human Rights. It also protects the right to information as such. This right to information is composed of two facets: on the one hand, an active component allowing the public to search for information and, on the other hand, a passive component allowing everyone to receive information. This explains the link between the “right to information” and the existence of web archiving initiatives. Indeed, these initiatives help to guarantee this right by facilitating research and access to information for the general public and society at large. In the Times Newspaper case, the European Court of Human Rights has already stated that web archives are protected by Article 10 of the Convention.

The importance of web archiving may be widely recognised. However, web archives raise many legal issues. These issues include among other the respective missions and responsibilities of national cultural heritage institutions in charge of web archiving, the delimitation of the national scope of jurisdiction for web archiving activities, copyright rules, sui generis right related to databases, data protection law, probative value of web archives and illegal contents.

In this presentation, after having discussed the right to information and explored the legal issues raised by web archiving, we will focus on two specific issues. Firstly, the GDPR provides a specific regime for archiving in the public interest and for historical and scientific research or statistical purposes. Indeed, when personal data are processed in specific contexts, the GDPR gives Member states the possibility to put in place a softened regime in terms of principles to follow, obligations to be respected and rights to be implemented. Secondly, the relevant copyright exceptions for a web archiving context considered in the Proposal Directive on copyright in the Digital Single Market. The analysis of the copyright legal framework will take the form of a mapping of all relevant considerations to establish a policy of selection and access to web archives.


TOM STORRAR & CHRIS DOYLE

Tom Storrar, The National Archives (UK)
Chris Doyle, MirrorWeb Ltd

Creating an archive of European Union law for Brexit

The UK is due to exit the EU on 29 March 2019. The Web Archiving team at The National Archives (UK) was given the job of producing a new publically available, comprehensive archive of European law for Exit Day. This project was a vital part of the UK government’s plans for Brexit. Leaving the European Union is a fundamental constitutional and legal change that effects millions of people and businesses. The European Union (Withdrawal) Act 2018 makes The National Archives responsible for publishing European legislation and other relevant documents that will continue to be the law in UK after the exit the European Union.

Creating a comprehensive archive of European law for Brexit involved harvesting the relevant parts of the EUR-Lex website (https://eurlex.europa.eu/), one of the largest and most complicated multilingual websites available online. This archive was created in partnership with MirrorWeb and involved deploying both existing technologies in a new way as well as developing some entirely new technologies and techniques.

We had a number of motivations for creating a web archive of this content, not least in order to demonstrate exactly what the law was when, in its original form, along with other content such as the extensive body European case law, which provides important context for the collection.

The challenge called for a highly focused project and innovative approaches to capturing, verifying and replaying the content. Over the course of 15 months we performed 2 complete data-driven crawls of the target content, archiving over 20 million resources. Between January 2019 and Exit Day we have continue to capture all newly published and modified content so that the archive will reflect EUR-Lex as it stands on Exit Day.

Each and every archived resource was identified using various data sources before being captured and quality assured through multiple checks. We developed new approaches to quality assurance and we had to be sure the collection was as complete within our chosen scoping, as possible. Finally the archive was indexed for public access through a customised replay platform and sophisticated full text search service.

This presentation will detail the purpose of the collection, the challenges encountered, describe the archiving strategies we employed to build it, our approach to quality assurance and how we developed the public-facing service from an initial “alpha” to a mature collection within our web archives. We will also describe our approach to preserving the web archive content, alongside accompanying files. Finally, we will reflect on whether similar approaches can be employed in order to successfully and confidently archive other large, complicated, multilingual websites.


ELS BREEDSTRAET

Publications Office of the European Union

Setting up an EU web preservation service for the long-term – tales of a (sometimes) bumpy road

The EU web archive contains the main websites of the EU institutions, which are hosted on the europa.eu domain and subdomains. Its aim is to preserve EU web content in the long term and to keep it accessible for the public.

In 2016, IIPC gave us the opportunity to make a short and general presentation about our archive. The presentation was received enthusiastically by the audience who were interested to hear more about our activities in this field. Since then, we have come a long way to providing a more mature web preservation service for the EU institutions. So we feel that now is a good moment to share, during a 30-minute presentation, the lessons learned on our journey from a pilot project to a fully-fledged, durable, long-term service.

The presentation will address the following topics:

  • Introduction on the EU web archive and its history.
  • Description of how the EU web preservation service looks today: what we do, why we do it, how we do it and for who.
  • Lessons learned, plans and challenges ahead.

These will be presented in a practical way, in order to give other practitioners in the audience ideas for tools to use “at home”.

By telling our story, we hope to provide other participants with useful tips and tricks. At the end of the presentation, the public will be invited to share questions, thoughts, suggestions and/or similar experiences. This way, we hope to learn in return also from their know-how.

As the aim of the presentation is tell the tale of our way to a mature web preservation service, it suits well within the general theme of the conference (Maturing Practice Together).


MARINOS PAPADOPOULOS, CHARALAMPOS BRATSAS, MICHALIS GEROLIMOS, KONSTANTINOS VAVOUSSIS, ELIZA MAKRIDOU & DIMITRA HIOTI

Marinos Papadopoulos, Attorney-at-Law
Charalampos Bratsas, Open Knowledge Foundation
Eliza Makridou, Michalis Gerolimos & Dimitra Hioti, National Library of Greece
Konstantinos Vavoussis, TRUST-IT Ltd.

Text and Data Mining for the National Library of Greece in consideration of GDPR

Text and Data Mining (TDM) as a technological option is usually leveraged upon by large libraries worldwide in the technologically enhanced processes of web-harvesting and web-archiving with the aim to collect, download, archive, and preserve content and works that are found available on the Internet. TDM is used to index, analyze, evaluate and interpret mass quantities of works including texts, sounds, images or data through an automated “tracking and pulling” process of online material. Access to the web content and works available online are subject to restrictions by legislation, especially to laws pertaining to Copyright, Industrial Property Rights and Data Privacy. As far as Data Privacy is concerned, the application of the General Data Protection Regulation (GDPR) is considered as an issue of vital importance, which among other requirements mandates the adoption of privacy-by-design and advanced security techniques. In the described framework, this paper focuses on the TDM design considerations and applied solutions employed by National Library of Greece (NLG). NLG has deployed TDM as of February 2017 in consideration of the provision of art.4(4)(b) of Law 4452/2017, as well as of the provisions of Regulation 2016/679/EU (GDPR). The Art.4(4)(b) of law 4452/2017 sets the TDM activity in Greece under the responsibility of NLG, appointed as the organization to undertake, allocate and coordinate the action of archiving the Hellenic web, i.e. as the organization responsible for text and data analysis at national level in Greece. The deployment of TDM by NLG, presented in this paper, caters for a framework of technical and legal considerations, so that the electronic service enabled based on the TDM operation complies with the data protection requirements set by the new EU legislation framework. The paper further elaborates upon said suitable set of technical and legal aspects considered by NLG for achieving GDPR compliance. The study falls under “Compliance with General Data Protection Regulation” thematic area in the framework of 2019 IIPC WAC participation.


PANELS

DITTE LAURSEN, SALLY CHAMBERS, KEES TESZELSZKY, VALÉRIE SCHAFER, DANIEL GOMES & PETER WEBSTER

Ditte Laursen, Royal Danish Library
Sally Chambers, Ghent University
Kees Teszelszky, KB, Netherlands
Valérie Schafer, University of Luxembourg
Daniel Gomes, Arquivo.pt – Fundação para a Ciência e a Tecnologia
Peter Webster, Webster Research and Consulting

Opportunities and challenges in collecting and studying national webs

A key issue for web archivists (particularly in national libraries) and for scholars alike is the meaning of the national web. Archivists working with legal deposit must work with a definition of their national web, which may be based on the ccTLD, but also on domain registration, the location of servers and/or other criteria. Scholars must then interpret those archives in the light of those definitions. Others studying nations without such legal frameworks face different challenges in working with archives compiled on a selective basis, or with materials held in multiple archives.

This panel brings together several of the contributors to ‘The Historical Web and Digital Humanities: The Case of National Webs’, (Routledge, 2019, edited by Niels Brügger and Ditte Laursen.) After briefly summarising their own contribution, they will discuss together the particular challenges of defining and then collecting the national web, and on studying the national web with the resulting archives.

The panel will be introduced, moderated and concluded by Peter Webster.

Ditte Laursen (Royal Library, Denmark) investigates how a corpus to support historical study of a national web can be established within national web archives, which usually hold several versions of the same web entity. Examining different datasets from the Danish national web archive 2005–2015, and the different ways these are handled, she demonstrates significant differences between results, with possible implications for research.

The Belgian web is currently not systematically archived. Sally Chambers (Ghent University) presents PROMISE, a research project into the feasibility of a sustainable web archiving service for Belgium. She traces the history of the Belgian web from the establishment of the .be domain in 1988 to the present, situating it in its historical, political, and legal context.

Kees Teszelszky (KB, Netherlands) explores the research opportunities of the Dutch national web for future historians by describing the development and unique characteristics of the Dutch national web. Using traditional historical methods and web archaeology, much historic data can be reconstructed, even though the KB web archive started only in 2007.

Valérie Schafer (University of Luxembourg) draws on the experience of the French Web90 project to show the approaches, tools and methodologies used to sketch a broad historical picture of the French web during the 1990s, and the challenges the project faced.

Peter Webster (Webster Research and Consulting Ltd, UK) outlines his chapter on the web estate of Irish churches. Though geographically concentrated in particular parts of the island, these churches’ web estate is dispersed across several ccTLDs and gTLDs. If this case study were matched elsewhere, it would suggests that the ccTLD is a weak proxy for the national web.

No organisation has formal, ongoing responsibility of whole-domain archiving for .eu, one of the largest and most popular European top-level domains, Daniel Gomes (Arquivo.pt) presents an overview of archiving activities related to .eu, including the only known effort to date to archive the entire domain. He also proposes a number of options for sustainable, long-term archiving for .eu.


KAROLINA HOLUB, INGE RUDOMINO, JANKO KLASINC, ŽARKO MIRKOVIĆ, LIDIJA POPOVIĆ, DRAGANA MILUNOVIĆ, NEMANJA KALEZIĆ & TAMARA HORVAT KLEMEN

Karolina Holub, Inge Rudomino & Dragana Milunović, National and University Library in Zagreb
Janko Klasinc, National and University Library of Slovenia
Žarko Mirković & Lidija Popović, National Library of Montenegro
Nemanja Kalezić, National Library of Serbia
Tamara Horvat Klemen, Central State Office for the Development of the Digital Society

Past, present and future of web archiving in Southeast Europe: experiences of Croatia, Montenegro, Serbia and Slovenia

Since the mid-1990s, World Wide Web occupies our lives and society and influences it immensely. It is self-evident that in the last 25 years web has defined and shaped our global digital history and became priceless resource that needs to be preserved. Almost right from the start, first major international initiatives were launched, with the aim to collect and preserve digital media, including the content on the web, for future generations. By now, numerous countries consider web archiving to be their core mission and obligation and their institutions and organisations collect, preserve and make accessible information from the web.

This panel will take a closer look at the countries in Southeast Europe and different initiatives related to web archiving. Several countries in this region started to archive the web in mid-2000s. With national libraries having the leading role, and based on national legislation, couple of countries started to build their web archives.

The panel will present different stages in web archiving in Croatia, Montenegro, Serbia and Slovenia. It will also discuss different approaches in workflows – diverse types of tools, quality assurance, organisational, financial and legal issues.

Web archiving in Slovenia began with a government funded research project that was carried out by the National and University Library (NUK) and the Jožef Stefan Institute between 2002 and 2004. After the legal framework for web archiving was established, in 2006 NUK began with selective and thematic harvesting of Slovenian websites in 2008. Access to archived content was enabled in 2011. Since 2014, NUK is also performing top national domain crawls biannually.

In Croatia, based on legal provisions, the National and University Library in Zagreb in collaboration with the University Computing Centre of the University of Zagreb established the Croatian Web Archive (HAW) in 2004, and started to collect, archive and give access to Croatian web resources. First the selective harvestings were conducted, and from 2011 thematic harvestings and annual harvesting of the top-level domain (.hr) are performed.

In addition to harvestings conducted by the National and University Library in Zagreb, the Central State Office for the Development of the Digital Society is harvesting public authorities’ web resources since 2004. Srce created „Archive of the Web Documents” and the Office has gathered the online resources providing contents for the “Digital Archive of the Web Resources of the Republic of Croatia”.

In January 2015, the experimental harvesting of the academic community (R. Serbia * .ac.rs) domain started in the National Library of Serbia, which was repeated twice again (June 2015 and September 2017). In addition, some domains that were known to be extinguished were also collected (Radio Srbija, Borba, E-novine…). For the spring of 2019, it is planned to collect the first thematic collection, which will contain news portals with an emphasis on local media.


LORI DONOVAN, MARIA RYAN, RENATE HANNEMANN & GARTH STEWART

Lori Donovan, Internet Archive
Maria Ryan
, National Library of Ireland
Renate Hannemann, Bibliotheksservice-Zentrum Baden-Wuerttemberg, BSZ
Garth Stewart, National Records Scotland

Transition in the Time of Web Archiving: Building (and rebuilding) Web Archiving Programs

Representatives from the Internet Archive (IA), National Library of Ireland (NLI), National Records Scotland (NRS) and Bibliotheksservice-Zentrum Baden-Wuerttemberg (BSZ) will discuss strategies and organizational approaches around conceiving, expanding, and transitioning web archiving programs in a variety of different types of organizations.

The panelists will begin by briefly presenting on the contexts for their web archiving programs, highlighting ways that web archiving is integrated into other institutional activities, services, and tools in each organization and unique structural, legal or other mandates each must work within. The panel will then have an open discussion of these points in more detail, digging into lessons learned, themes of program growth and sustainability, and outline how the changing landscape of web archiving tools and services impact programs in these organizations. While the panelists will be prepared for a wide-ranging 60 minute discussion, time will be saved at the end for potential audience questions, feedback and contributions.

Lori Donovan of the Internet Archive will help guide discussion during the panel, speak to the unique opportunities for web archiving in a non-profit digital library context, and focus on IA web archiving services and the various models for working with institutions both large and small to help them structure their web archiving programs to meet organizational goals and requirements.

Maria Ryan of NLI will focus on the development of the NLI’s web archive, from a pilot project in 2011, to launching a domain crawl in 2017. The NLI web archive is now an established collecting programme but it operates within a limiting national legal framework. This presentation will examine how the NLI has continued to expand its web archive in the absence of legal deposit legislation.

Renate Hannemann of BSZ will briefly present the BSZ and its tasks, the activities of BSZ in the field of web archiving since 2006 and the participation in Archive-It since 2016: the path to implementing Archive-It, including migration of historical data and BSZ’s consortium model.

Garth Stewart of National Records of Scotland will describe the story of how the NRS Web Continuity Service came to be: from a long-held institutional idea, to our current multi-faceted service that delivers web archiving and its associated benefits to our stakeholders. Garth will explain how ‘quality’ is a concept rooted in our service, and also reflect on how the Service is supporting NRS’s transformation into becoming a digital national archive.


SARA AUBRY, GÉRALDINE CAMILE & THOMAS DRUGEON

Sara Aubry & Géraldine Camile, Bibliothèque nationale de France (BnF)
Thomas Drugeon, Institut national de l’Audiovisuel (INA)

From videos to channels: archiving video content on the web

Archiving video content on the web poses particular challenges, and web archiving institutions must use particular technical strategies to not only collect this content but to give access and ensure its long-term preservation. Based on the experience of the Bibliothèque nationale de France (BnF), National Audiovisual Institute (INA) and other institutions, this panel will explore issues raised and different approaches used.

The BnF first crawled videos included in web pages, and from 2008 to 2013 performed a specific crawl of the most-used video platform in France, Dailymotion. The crawl used Heritrix, but it was necessary to use other tools and perform new analyses for each crawl and the BnF was unable to maintain this specific crawl. For the presidential elections in 2017 the BnF subcontracted the crawl of 28 channels on YouTube. The crawl by Internet Memory Research included the web pages, videos and also API metadata. Developments were necessary to include these videos in the preservation workflow, and to provide access.

With the lessons learned from this experience, the BnF was able to perform an in-house crawl of YouTube in 2018, using Heritrix 3 with additional tools to extract metadata and the URL of the video file. The process was included in our standard workflow, simplifying the preservation process. In the BnF access interface, based on OpenWayback, it is possible to view the web pages, with the video in an FLV player that replaces the YouTube player. Metadata collected during the crawl allow the creation of a link between the page and the video file, and also a list of all the videos from different crawls on a same channel.

INA has been continuously collecting videos from platforms since 2008. As of January 2019 we have collected 21 million videos among 16 platforms including Youtube, Twitter, Facebook and main TV/radio broadcast platforms. This represents 2 millions of hours that are made accessible to researchers through a specialized search engine as well as directly from the archived page they were published on. A unified TV/radio/web access is also in the making, giving access to web videos in the same context as the broadcast programs they are related to.

INA automatically crawls videos found embedded in archived web pages or published on one of the 7000 followed channels. Crawling relies on specialized robots developed and maintained in-house, making it easier to follow technical changes in publication methods. Metadata are extracted and normalized whereas videos are kept in their original format. Conversion are operated on the fly by the archive video server when deemed necessary (eg flv to mp4 or webm conversion when flash has to be avoided) to ensure compatibility with the target device without having to resort to batch conversions.

This panel will present different approaches to the challenges of crawling, preserving and giving access to web video content. Speakers will be asked to briefly present a panorama of strategies techniques used, with time kept for discussion between speakers and with the audience to compare the advantages and disadvantages of different approaches and identify means of improvement.


ABBIE GROTKE, ALEXANDRE CHAUTEMPS, NICOLA BINGHAM, MARIA RYAN, ALEX THURMAN & DANIEL GOMES

Abbie Grotke, Library of Congress
Alexandre Chautemps, Bibliothèque nationale de France
Nicola Bingham, British Library
Maria Ryan, National Library of Ireland
Alex Thurman, Columbia University Libraries
Daniel Gomes, Arquivo.pt – Fundação para a Ciência e a Tecnologia

Access Policies, Challenges, and Approaches

Organizations involved in web archiving are often faced with questions around giving access to archived content. Many are faced with restrictions on some or all archived content that make it difficult, impossible, problematic or simply impractical to provide access outside of the walls of the collecting organization.

A variety of influences affect the extent and policies around access. These include legal deposit laws, intellectual property and copyright laws and data protection legislation, ethical questions and concerns, risk assessments, and concern about providing access to sensitive content. How these issues are addressed can depend on different legal contexts but also on institutional policy.

Other organizations not affected by legal deposit may have differing challenges/approaches to access — risk averseness when in some cases access can be provided, but should it be? And to what extent? And should permission be requested? Can archived web content be provided in a non-consumptive way to alert researchers to content that may not be available?

This panel will explore a number of strategies organizations have employed for providing access to web archives. Panelists will share challenges they’ve faced and specific approaches used when enabling access to archived content; often a mixture of approaches is necessary. Discussion topics will include risk averseness and assessment, permissions approaches, embargoing content, providing access to derivative data sets or descriptive records in lieu of access to archived content that may be restricted, creating dark archives or determining not to collect something because access cannot be provided.

Note: The intent of this proposed panel is to build upon and extend the discussion held in New Zealand in 2018 during the “Legal deposit in an era of transnational content and global tech titans” panel (http://netpreserve.org/ga2018/programme/abstracts/#panel04). While that panel was rooted in institutional policy and legislation, this follow-on discussion focuses on the different types of access provided in a variety of legal situations.


SYLVAIN BÉLANGER, NICK RUEST, IAN MILLIGAN & ANNA PERRICCI

Sylvain Bélanger, Library and Archives Canada
Nick Ruest, York University
Ian Milligan, University of Waterloo
Anna Perricci, Rhizome

Sustainability Panel: Preservation of Digital Collections, Webrecorder and Archives Unleased Cloud Project

One of the major issues facing the web archiving community is that while systems exist to acquire, analyse and preserve web archive content, they require a considerable level of resource to deploy, use and maintain. This panel will discuss the problems of long term sustainability in the web archiving ecosystem, focussing on issues such as capacity for sustainable digital preservation, technical infrastructure development, tools development and project resilience. The panel will consider that reaching sustainability requires an approach including organisational, financial and technical effort and will share examples of how this has been achieved within the panellists own organisations/projects.

Sylvain Bélanger: Preservation of Digital Collections, from obsolescence to sustainability

This presentation will delve into the issues Library and Archives Canada faces, as a national library and archives, in tackling obsolete formats, and in applying digital preservation principles, while living within our means. This presentation focuses on what LAC has been doing to address the petabytes of digital collections ingested in LAC annually, including the dozen of terabytes of web content annually, through the lenses of the development of a sustainable digital preservation program and technical infrastructure advancements.

Ever wonder what happens to digital collections once Library and Archives Canada (LAC) receives them from publishers, universities, archival donors and government institutions? With physical collections, they are stored in a vault, in a storage container, in specialized housing or simply on a shelf. With digital collections, it is not that straightforward, and in years past, it was tortuous.

Traditionally, over many hours of manual interaction, IT specialists in the Digital Preservation team, along with library and archival staff, would extract data bit by bit from carriers. Then they would face the daunting task of migrating data from archaic formats to modern, readable and accessible ones for client access and long-term preservation.

LAC developed what we called a Trusted Digital Repository in the late 2000s, which involved continued manual interaction with our collections but little in the way of automation or simplification.

In the early 2010s, the Digital Preservation unit was a fledgling team, barely visible and even less resourced. There were multiple internal and external pressures on LAC to increase its digital preservation capacity. In particular, an accelerating volume of digital materials needed to be preserved for the long term. The Auditor General of Canada issued a report in 2014 raising questions about the readiness of LAC to handle digital records as the format of choice by 2017. It stated that LAC “must articulate these plans in its vision, mission, and objectives. It must put in place strategies, policies, and procedures that will allow the transfer and preservation of digital information so that it is accessible to current and future generations.” The audit report noted: “An electronic archival system, such as a trusted digital repository, could help [LAC] acquire, preserve, and facilitate access to its digital collection.”

Although the overarching institutional goal for a trusted digital repository stayed constant throughout this decade, changing institutional priorities and the focus on technology and short-term projects stimulated a re-examination of what was needed to install digital preservation as a core and enduring business component.

The audit report was a call to action in dealing with our digital content, and it pushed LAC to attempt, for the umpteenth time, to tackle the problem head on. A team of stakeholders provided input and feedback into what would become a call-out to industry for a digital asset management solution that could support LAC’s requirements. Industry and partner consultations were held over many months and helped shape LAC’s request for proposals that finally went out in late summer 2017.

In summer 2018, LAC acquired digital asset management technology, along with associated technologies to allow us to implement a solution (for pre-ingest, ingest and preservation processes) for collections coming to LAC in digital format. This means no longer receiving hard drives and other technology carriers, but also a wholesale modernization of our digital work.

We have finally reached the starting point!

What this really means is that we are still in the early stages of implementing a viable solution. Teams from Digital Operations and Preservation, Published Heritage, and the Chief Information Officer branches have been working on the first series of collections to process from clients, through to preservation and future access. Using specialized managed file-transfer software for pre-ingesting the metadata and assets, to testing the preservation capabilities of Preservica, everything is being reviewed with the aim of transforming how we manage our digital operations. To ensure a seamless and effective testing approach, as we are testing published workflows, staff within Published Heritage dedicated to this work full-time are working hand in hand with preservation and IT specialists to implement seamless processes.

For LAC, the implementation of a digital asset management system means being at the forefront of digital acquisition and preservation. Many partners, both nationally and internationally, are keen to understand the approach we have taken over the past four years, and how we are integrating various technologies to implement our long-term digital vision for both published and archival collections.

Even more important is what a digital asset management system may provide to Canadians in the long term: digital collections that are preserved and accessible to them when and where they want them.

This is but one step in LAC’s digital transformation.

Nick Ruest & Ian Milligan: Project Sustainability and Research Platforms: The Archives Unleashed Cloud Project

The Archives Unleashed Project, founded in 2017 with funding from the Andrew W. Mellon Foundation, aims to make petabytes of historical internet content accessible to scholars and others interested in researching the recent past. We respond to one of the major issues facing web archiving research: that while tools exist to work with WARC files and to enable computational analysis, they require a considerable level of technical knowledge to deploy, use, and maintain.

Our project uses the Archives Unleashed Toolkit, an open-source platform for analyzing web archives (https://github.com/archivesunleashed/aut). Due to space constraints we do not discuss the Toolkit at length in this abstract. While the Toolkit can analyze ARC and WARC files at scale, it requires knowledge of the command line and a developer environment. We recognize that this level of technical expertise is beyond the level of the average humanities or social sciences researcher, and our approaches discussed in this paper concern themselves with making these underlying technical infrastructures accessible.

This presentation expands upon the Archives Unleashed Cloud, building upon previous presentations of earlier work at the IIPC meeting in Wellington. This is both to introduce it to researchers, but in this presentation we will focus on stimulating a conversation around where the work of the researcher begins and the work of the research platform ends. It also discusses the problem of long-term project sustainability. Researchers want services such as the Cloud, but how do we provide this service to them in a cost-effective manner? This targeted discussion will speak not only to our project, but broader issues within the web archiving ecosystem throughout the field.

As we develop the working version of the Archives Unleashed Cloud, one of the main concerns of the project team is the future of the Cloud after Mellon funding ends in 2020. While we are currently exploring whether the Cloud makes sense as a stand-alone non-profit corporation, we are still unsure about the future direction. How do services like this, that meet demonstrated needs, survive in the long run? Our presentation discusses our current strategies but hopes to engage the audience around the state-of-the-field and how to best reach web archiving practitioners.

Projects and services like WebRecorder.io and Archive-It have made amazing strides in the world of web archive crawling and capture. The Archives Unleashed Cloud seeks to make web archiving analysis similarly easy and straightforward. Yet the scale of web archival data makes this less straightforward.

Anna Perricci: No one said this would be easy: sustaining Webrecorder as a robust web archiving tool set for all.

Sustaining projects both organizationally and financially is hard especially in complex, fast moving areas like web archiving. This presentation will give an overview of steps the Webrecorder team has taken to achieve sustainability both organizationally and financially.

Webrecorder is a project of Rhizome, which is an affiliate of the New Museum in New York City. Rhizome champions born-digital art and culture through commissions, exhibitions, digital preservation, and software development. Webrecorder (webrecorder.io) is a free, easy to use, browser based web archiving tool set for building, maintaining and giving access to web archives. The development of Webrecorder has been generously supported by the Andrew W. Mellon Foundation since 2016, and the Knight Foundation (2016-2018). In addition to offering a free hosted web archiving platform Rhizome creates customizations of our Python Wayback (pywb) tool set for other web archives. Pywb is in use in some major web archiving programs including at the UK Web Archive (British Library), the Portuguese Web Archive (Arquivo.pt) and Perma.cc. The Webrecorder team also makes other open source software projects such as Webrecorder Player (https://github.com/webrecorder/webrecorder-player) and command line utilities such as warcit (https://github.com/webrecorder/warcit).

In 2017 strategic planning for Webrecorder began and further steps to build a business plan grew from that point. In this presentation an overview of the issues explored and conclusions reached so far will be given. These points will illuminate why Webrecorder has made certain choices and where we anticipate Webrecorder will go next.

It would be an honor to share the work we have done so far at the IIPC WAC 2019. Sharing our findings to date and explaining the decisions they helped us make might also be useful to others who need to figure out how to break down big problems into more manageable units. No one said reaching sustainability would be easy, and it has not been, but the Webrecorder team has made substantial progress so we would like to share what’s been learned with all conference attendees.


POSTERS AND LIGHTNING TALKS

JI SHIYAN & ZHAO DANYANG

National Library of China

The Key Technologies of Web Information Preservation and Service System Platform

The Web information collection and preservation project of National Library of China has been done for many years, accumulated abundant practical experience and developed the Web information preservation and service platform, and United the national library to jointly carry out the Web information preservation vocational work. This paper analyzes the key technology of the platform architecture in detail based on the analysis the vocational work process, function and practical effects of the platform, and provides reference for other organizations to carry out the related work.

The system platform is designed as the model of the national library as the central node and the national local officials as the sub-node, each node has its own independent system platform to realize the hierarchical and hierarchical management framework. Each node establishes a local distributed storage system based on HDFS. Distributed vocational work Framework based on IIPC open source software and virtualization Technology realizes Multi-node processes. According to the need of business and easy to expand, the vocational work process is handled by modularization and the platform is designed as seven function modules to realize the standardization of the whole vocational work process.


BEN O’BRIEN & JEFFREY VAN DER HOEVEN

Ben O’Brien, National Library of New Zealand
Jeffrey Van Der Hoeven, Koninklijke Bibliotheek

Technical Uplift of the Web Curator Tool

Colleagues at the National Library of New Zealand and the National Library of the Netherlands are continuing to develop the Web Curator Tool (WCT) after releasing version 2.0 in December 2018. This poster will highlight the 2019 enhancements and what we learned in the process.

The goal of v2.0 was to uplift the crawling capability of the WCT by integrating Heritrix 3. This addressed what was seen as the most deficient area of the WCT. It was discovered during a proof-of-concept that the Heritrix 3 integration could be achieved without significant upgrade of the WCT’s outdated libraries and frameworks. But further functionality could not be developed until those libraries and frameworks had been uplifted, providing a stable modern base for new functionality. Now that v2.0 has been completed, the next milestone in the WCT development is to perform this technical uplift.

Besides the technical uplift, two other items of work on the development plan for WCT in the first half of 2019 are: component-based REST APIs and documenting user journeys.

We want to make the WCT much more flexible and less coupled by exposing each component via an API layer. In order to make that API development much easier we are looking to migrate the existing SOAP API to REST and changing components so that they are less dependent on each other. One of those components is the Harvest Agent, which acts as a wrapper for the current Heritrix 3 crawler we use. Our goal is to develop this component to integrate with additional web crawlers, such as Brozzler.

The process of mapping user journeys, the way users interact with the WCT, is long overdue. Future development will involve writing unit and/or integration tests that cover those essential user journeys. These tests will be used to ensure that all essential functionality remains through all development changes.

This poster and lightning talk will cover the exercise of upgrading a 13 year old Java application, migrating components of it to use REST APIs and the likely challenges and pitfalls that we will encounter. We also hope to share any insights from documenting the WCT user journeys. If possible, we would prefer to submit a digital poster so that we can embed short demos of any new WCT functionality and demonstrate invoking another crawler from within the WCT.


HELENA BYRNE

British Library

From the sidelines to the archived web: What are the most annoying football phrases in the UK?

As the news and TV coverage of football has increased in recent years, there has been growing interest in the type of language and phrases used to describe the game. Online, there have been numerous news articles, blog posts and lists on public internet forums on what are the most annoying football clichés. However, all these lists focus on the men’s game and finding a similar list on women’s football online was very challenging. Only by posting a tweet with a survey to ask the public “What do you think are the most annoying phrases to describe women’s football?” was I able to collate an appropriate sample to work through.

Consequently, the lack of any such list in a similar format highlights the issue of gender inequality online as this is a reflection of wider society. I filtered a sample of the phrases from men’s and women’s football to find the top five most annoying phrases. I then ran these phrases through the UK Web Archive Shine interface to determine their popularity on the archived web. The UK Web Archive Shine interface was first developed in 2015, as part of the Big UK Domain Data for the Arts and Humanities project. This presentation will assess how useful the Trends function on the Shine interface is to determine the popularity of a sample of selected football phrases from 1996 to 2013 on the UK web. The Shine interface searches across 3,520,628,647 distinct records from .uk domain, captured from January 1996 to the 6th April 2013.

It is hoped that the findings from this study will be of interest to the footballing world but more importantly, encourage further research in sports and linguistics using the UK Web Archive.

References:
Helena Byrne. (2018). What do you think are the most annoying phrases to describe women’s football? https://footballcollective.org.uk/2018/05/18/what-do-you-think-are-the-most-annoying-phrases-to-describe-womens-football/ (Accessed August 26, 2018)
Andrew Jackson. (2016). Introducing SHINE 2.0 – A Historical Search Engine. Retrieved from: http://blogs.bl.uk/webarchive/2016/02/updating-our-historical-search-service.html (Accessed August 26, 2018)


DANIELA LUCIA CALABRESE

Università della Calabria

Super-schema in PREMIS 3.0: a new way to store websites and preserve our memory

Preserving websites, especially the public administration ones, ensures the proper functioning of a Nation. The future is clear: we proceed towards a general digitalization, and it’s not sci-fi.

In the current state-of-art, we are aware of the risk of losing important web-information due to unclear laws and lack of specific guidelines.

Preserving a public administration website means not only to make its contents available in the future, to researchers, scholars, historians, but also to allow access to the preserved contents “in progress” to work on these stored materials with the least possible effort in terms of human and economic resources. The necessity to have constant access to the preserved contents is fundamental because of the continuous development and evolution of the laws, such as the GDPR UE 2016/679.

On the basis of these considerations, it is quite evident the needs for a universal language and clear rules that should guarantee access to the preserved information both during and over the time.

To this end, this poster and lighting talk propose an XML schema designed according to the PREMIS 3.0 rules and coupled with a user guide that provides to end-users guidelines and information for its compilation, avoiding any possible ambiguity.

The proposed super-schema, thanks to the use of the new extension containers (with the suffix Extension), integrates PREMIS with other metadata sets specialized in certain domains (e.g., METS, DUBLIN CORE, MODS). This solution allows to have a flexible and fixed super-schema: flexible in the compilation but well structured and fixed in its basic structure.

The proposed solution supports and facilitates the preservation and the managing, at the same time, of this kind of information. In particular, it supports preservers thanks to its capability to make interoperable Archival Information Packages (AIP) and provides to the producers (public administrations, corporate bodies) guidelines on how to produce their own submission information packages (SIP). Furthermore, it facilitates the search and acquisition of information through the standardized data package description within a conservation system.

The mission of the proposed XML super-schema is not only to maintain the characteristics of authenticity, integrity, and intelligibility of information over time but also to assure access to the preserved heritage to global and designated communities, even in the case of work in progress.


ANDREJ BIZÍK, PETER HAUSLEITNER & JANA MATÚŠKOVÁ

University library in Bratislava

Archiving and LTP of websites and Born Digital Documents in the Slovak Republic

Electronic documents and web sites should be preserved similarly to physical objects of lasting value. This is performed using the long-term storage platform. In 2015 the University Library in Bratislava (ULB) carried into operation a system for a controlled web harvesting and e-Born archiving (as a result of the national project Digital Resources – Web Harvesting and e-Born Content Archiving). Nowadays, the project is in the phase of sustainability and all activities are provided by the department Deposit of Digital Resources.

This contribution focuses on the specific solution of the archiving of the Slovak web sites and Born Digital documents and their long-term preservation (LTP). The archiving is carried out in the Information System Digital Resources (IS DR) and archived resources are delivered to the Central Data Archive (CDA), which serves as the LTP storage. The Central Data Archive is designed and operates in compliance with the requirements and standards for trusted long-time storages (ISO 16363, ISO 14721).

We will present the process from the archiving of the content in the IS DR to its storing in the CDA. The data are delivered to the CDA in form of Submission Information Packages (SIPs). The integrated creation of SIP files in the Deposit of Digital Resources is an efficient semi-automatic solution with a minimal intervention by the curator. Every SIP is a compressed ZIP file (in compliance with the CDA requirements) and contains descriptive metadata and archived files. A programmed script creates packages, signs and saves them in a temporary repository. Every SIP is signed by SSL certificate – the certificate authority is the CDA. The SIPs, confirmed by the curator, are transferred to a temporary CDA repository and waiting for further processing. After the successful validation, verification and format control, SIPs are transformed into Archival Information Packages (AIPs). The generated AIP number is added to the IS DR.


SEBASTIAN NAGEL

Common Crawl

Accessing WARC files via SQL

Similar to many other web archiving initiatives Common Crawl uses WARC as primary storage format and a CDX index to look up WARC records by URL. Recently we’ve made available a columnar index in the Apache Parquet format which can queried and analysed using SQL by multiple big data tools and managed cloud computing services. The analytical power of SQL allows to gain insight into the archives and aggregate statistics and metrics within minutes. We also demonstrate how the WARC web archives can now be processed “vertically” at scale, enabling users to pick captures not only by URL but by any metadata provided (e.g., content language, MIME type) or even a combination of URL and metadata.


WORKSHOPS

BEN O’BRIEN, STEVE KNIGHT, TRIENKA ROHRBACH, KEES TESZELSZKY & JEFFREY VAN DER HOEVEN

Ben O’Brien & Steve Knight, National Library of New-Zealand
Trienka Rohrbach, Kees Teszelszky & Jeffrey van der Hoeven, National Library of the Netherlands

Web Curator Tool (WCT) Workshop

Description:

This workshop will

  1. provide participants with a hands-on opportunity to learn the basic features of the WCT – setting up accounts, selecting & scoping target websites, scheduling and running crawls, and performing QA on resulting harvests
  2. highlight new features developed through collaboration between National Library of New Zealand (NLNZ) and the National Library of the Netherlands (KBNL)
  3. provide participants with an overview of the planned development roadmap and an opportunity to contribute to the roadmap by requesting enhancements
  4. provide a forum to openly discuss ways to create a vibrant community around WCT users and developers which makes WCT sustainable into the future.

Prior to the workshop, participants will be given instructions to install WCT on the laptops they will be encouraged to bring to the workshop.

Target Audience:

Existing and potential WCT users plus interested individuals from the wider web archiving community. We regularly find through the WCT support channels, people and institutions that are new to web archiving wanting to try the WCT. It is often viewed as having a low technical barrier to general use, which we believe is important in bringing in new participants to web archiving. Even after 10+ years, the WCT still contributes as one of the most common, open-source enterprise solutions for web archiving.

Participant numbers estimated at 20.

Program:

Introductions (25 mins): names, institutions, experience with WCT & About WCT Collaboration (25 mins)

  • Year 1
  • What’s new in WCT version 2.x,
  • Documentation
  • Current roadmap for development in relation with other web archiving techniques

Demo with Q&A (40 mins)

  • General overview of the WCT features
  • Demo of version 2.x features
  • Install / upgrade demo (participants could install WCT if not already done)
  • Advise on migrating Heritrix profiles
  • Technical implementation at the KBNL & NLNZ

Break (30 mins)

Hands-on Learning (45 mins)

  • Try the basic features of WCT
    • Set up user accounts
    • Select & scope target websites
    • Schedule & run crawl
    • QA crawl

Community Engagement (45 mins)

  • Making the wish list: what are the needs of users?
  • What are the barriers to adopting WCT and how can we overcome them?
  • How to contribute to this emerging community?
  • WCT support

Background:

In 2006 NLNZ and the British Library developed the WCT, a collaborative open-source software project conducted under the auspices of the IIPC. The WCT manages the web harvesting workflow, from selecting, scoping and scheduling crawls, through to harvesting, quality assurance, and archiving to a preservation repository. NLNZ has used the WCT for its selective web archiving programme since January 2007.

However, the software had fallen into a period of neglect, with mounting technical debt: most notably its tight integration with an outdated version of the Heritrix web crawler. While the WCT is still used day-to-day in various institutions, it had essentially reached its end-of-life as it had fallen further and further behind the requirements for harvesting the modern web. The community of users have echoed these sentiments over the last few years.

During 2016/17 NLNZ conducted a review of the WCT and how it fulfils business requirements and compared the WCT to alternative software/services. The NLNZ concluded that the WCT was still the closest solution to meeting its requirements – provided the necessary upgrades could be made to it, in particular an upgrade to the Heritrix 3 web crawler.

At the same time, another WCT user, the KBNL, was going through a similar review process and had reached the same conclusions. This led to collaborative development between the two institutions to uplift the WCT technically and functionally to be a fit for purpose tool within these institutions’ respective web archiving programmes.


SARA AUBRY

Bibliothèque nationale de France (BnF)

The WARC file format: update and exchange on latest works

The WARC (Web ARChive) file format was defined to support the web archiving community in harvesting web resources, accessing web archives in a variety of ways and preserving large numbers of born-digital files on the long term. It was initially released as an ISO international standard in May 2009 and first and in a minor scope revised in August 2017. The next revision vote is currently scheduled for 2022 with publication for 2025.

This discussion aims at gathering IIPC members who expressed interest in introducing changes and evolutions to WARC during “The WARC file format: preparing next steps” workshop during the IIPC GA and WAC in Wellington, November 2018.

The objective is to exchange on the first use cases, tests and pratical implemantations on the first identified topics (related resources, possible extensions for HTTP2, identify provenance headers, keep track of dynamic history, clarify warcfile name and compression) and beyond if needed.

Exchanges on IIPC Github (http://iipc.github.io/warc-specifications/) and Slack (https://iipc.slack.com) channels will be used to prepare and structure the discussion before the face-to-face meeting.


JULIEN NIOCHE

CameraForensics

Introduction to web crawling with StormCrawler (and Elasticsearch)

In this workshop, we will explore StormCrawler a collection of resources for building low-latency, large scale web crawlers on Apache Storm. After a short introduction to Apache Storm and an overview of what Storm-Crawler provides, we’ll put it to use for a simple crawl before moving on to the deployed mode of Storm.

In the second part of the session, we will introduce metrics and index documents with Elasticsearch and Kibana and dive into data extraction. Finally, we’ll cover recursive crawls and scalability. This course will be hands-on: attendees will run the code on their own machines.

Agenda

We will cover the following topics:

  • Introduction to web crawling
  • Apache Storm: architecture and concepts
  • StormCrawler: basic building blocks
  • How to use the archetype
  • Building & configuring
  • URLFilters, ParseFilters
  • Simple recursive crawls
  • How to debug?
  • Distributed mode: UI, logs, metrics
  • Elasticsearch resources
  • WARC module
  • Q&As

Audience

This course will suit Java developers with an interest in big data, stream processing, web crawling and archiving. It will provide a practical introduction to both Apache Storm and Elasticsearch as well of course as StormCrawler and will not require advanced programming skills.

Prerequisites

Attendees should bring their own laptop with Apache Maven and Java 8 or above installed. The examples and instructions will be conducted on a Linux distribution and using Eclipe IDE. Ideally, students should look at the Apache Storm and StormCrawler documentation and think about particular websites or crawl scenarios that they might be interested in.


HELENA BYRNE & CARLOS RARUGAL

British Library

Reflecting on how we train new starters in web archiving

Web archiving is a niche area of expertise in the information management sector for both curatorial and technical staff. Professionals in this sector have built up their knowledge over time by learning on the job from colleagues and through research outputs disseminated through blog posts, academic papers and conference presentations. As there is no formal structure to introduce new staff members to web archiving, how these colleagues are trained depends on what resources individual institutions have to hand. The IIPC has worked on developing a collective approach to training within this sector but there are still gaps in our knowledge.[1]

At the British Library we use a range of strategies to train new starters and external partners we work with on building new collections as part of the UK Web Archive. When training new members of staff on the curatorial side we usually have a discussion first that gives the background to web archiving at the British Library, an overview of the UK Non-Print Legal Deposit Regulations that came into force in April 2013 and a walk through of our curatorial tool W3ACT. A reading list of relevant blog posts about working at the UK Web Archive and the technical limitations and possibilities to read during the course of the first week and a set of practice seeds to get more familiar with W3ACT. Once the staff member is more familiar with W3ACT they are given an overview of the Quality Assurance (QA) strategies used at the BL and put to task on a special project like doing QA and seeking open access permissions for a curated collection or a subsection of a large curated collection. It is only recently that this process has become more formalised with additional materials being produced to compliment the original support materials such as the W3ACT User Guide. We have produced a comprehensive QA Guide for internal use and are working on developing training videos over the course of this year. The technical staff at the British Library work in a separate department to the curatorial staff and go through a different training process. Most of the training is on the job, complimented with documentation on the internal wiki pages and there are plans to have more formal support materials developed over the next year that could also be used by curatorial staff.

There are lots of debates on whether or not there is such thing as different learning styles but one thing is certain is that people have preferences for how they like to communicate with each other. [2]

Most people when learning can relate to the Benjamin Franklin quote ‘tell me and I forget, teach me and I may remember, involve me and I learn’. [3]

It can be very challenging to find the most effective way to involve a trainee in web archiving and transfer your specialist knowledge. However, before taking on new strategies it is important to understand your own beliefs on training and what actions you currently take when training new staff. Reflecting on these points can help you to become more aware of any biases you may have in terms of preferred training delivery style which could be contradictory to what the trainee really needs.

This is an introductory workshop on how reflective practice can improve our work practices. This workshop will reflect on what are the positive and negative methods participants use when they are training new staff members on web archiving. As well as, how they can assess the success of this training whether it be a formal or informal review at the end of a probation period, a mid-year review or for a mandatory Performance Management Review (PMR).

One definition of reflection is that ‘it is a basic part of teaching and learning. It aims to make you more aware of your own professional knowledge and action by ‘challenging assumptions of everyday practice and critically evaluating practitioners’ own responses to practice situations’. The reflective process encourages you to work with others as you can share best practice and draw on others for support. Ultimately, reflection makes sure all students learn more effectively as learning can be tailored to their needs’. [4]

By being more aware of what we do and why we do it, we can create a trainee centred learning environment. This workshop will have the following outcomes:

  1. Participants will become more aware of their current practice of training new staff.
  2. Participants will learn from colleagues what strategies have worked or not worked in the past.
  3. Participants will develop their reflective practice skills that they can apply to other elements of their work practices.
  4. The main discussion points that come out of this workshop will be written up in a blog post that will be shared with the web archiving community. It is hoped that it can help inform the development of web archiving training materials.

References:

[1] IIPC Training Working Group, formed in December 2017, http://netpreserve.org/about-us/working-groups/training-working-group/ (accessed January 3, 2018); IIPC Past projects, ‘How to fit in? Integrate a web archiving program in your organization’, http://netpreserve.org/projects/how-fit-integrate-web-archiving-program-your-organization/ (accessed January 3, 2018).
[2] The Atlantic, ‘The Myth of Learning Styles’, https://www.theatlantic.com/science/archive/2018/04/the-myth-of-learning-styles/557687/ (accessed December 20, 2018).
[3] Goodreads.com, ‘Benjamin Franklin > Quotes > Quotable Quote’, https://www.goodreads.com/quotes/21262-tell-me-and-i-forget-teach-me-and-i-may (accessed December 20, 2018).
[4] Cambridge International Education Teaching and Learning Team, ‘What is reflective practice?’ https://www.cambridge-community.org.uk/professional-development/gswrp/index.html (accessed December 20, 2018).