Technical infrastructure

11:00 – 11:20

Miroslav Milinović & Draženko Celjak: Technical development of Croatian Web Archive in past 15 years

SRCE – University of Zagreb University Computing Centre

Although the Croatian Web Archive was launched in 2004, the development of the appropriate tool(s) started earlier with the project of Croatian Web Measurement (MWP). In 2002 the team from SRCE – University of Zagreb University Computing Centre measured the Croatian Web for the first time. The goal was to estimate the size and the complexity of the Croatian Web and to acquire the basic information about its content. For that purpose, the team developed a custom software. Following on gained experience and in cooperation with the National and University Library, the software developed for MWP project was expanded with the capability of web capturing and archiving. In addition, the web interface was developed for configuring and managing the capturing process.

Based on that early results Croatian Web Archive has been officially launched in 2004 as the system for gathering and storage of the legal deposit copies of Croatian web resources with scientifically or culturally relevant content. Selective capturing of web resources was based on National and University Library’s online catalogue. Over the time, selective capturing was complemented with the domain harvesting and thematic harvesting features.

Today Croatian Web Archive contains a collection of more than 63.000 instances of webs sites as a result of selective capturing, 8 domain harvests and 10 thematic harvests. All of that content is available online for end users and for services via OAI-PMH interface. End users can browse and search Archive’s content using various criteria.

First harvesting of the Croatian top-level internet domain .hr took place during July and August 2011. At that time, we got valuable experience which let us enhance the architecture of the harvesting part of the system to make it more efficient and faster.

This talk puts emphasis on experiences gained through the process of planning, execution and analysis of the results of web harvesting, selective web capturing and web measurement. We present the technical challenges we have encountered over time and remedies we used to ensure desired functionalities of the Croatian Web Archive.

11:20 – 11:40

Márton Németh & László Drótos: Metadata supported full-text search in a web archive

National Széchényi Library

Content from web-archives can be retrieved in various levels. The most simple solution is the retrieval by URL. However in this case we must know the exact URL address of the archived webpage in order to retrieve the desired information. The next phase is to search on the title (or other metadata elements can be found in the source code) of the homepages. Texts of links that are pointing to a website can also be searchable. However relevant hits can be retrieved in this case only by individual website-level. Although metadata also can be extracted from various archived file types (like HTML, PDF), by our experiences these kind of metadata are often missing and even if those are exists, they are sometimes too general or ambiguous. So search on exact, narrow topics is only available by a full-text search function. In this case, ranking by relevance is the biggest challenge. Google has a ranking algorithm that has developed for 20 years and using more than 200 parameters. This company also building an enormous database based on the search and retrieval preferences, interactions and other user-based features. These algorithms and databases are not available for national libraries.

Through the running of the Hungarian Web Archiving Project we have started an experiment in order to find how to use of website-level metadata that are being recorded by librarians (e.g. genre, topic, subject, uniform title) for filtering retrieval lists generated by full-text search engines, how to refine search queries and how to display retrieval hits in a more comprehensive and user-friendly way.

By the first part of the presentation we are offering a brief overview about the metadata structure that is currently being used at the National Széchényi Library. This schema is following the recommendation set of OCLC Web Archiving Metadata Working Group. Then we briefly present the Solrwayback search engine developed by Danish partners, which is currently running and being on test on our demo collection. In the following we would like to introduce another Solr-based search system that has developed in the National Széchényi Library that can retrieve and take into account data from XML-based metadata records. In the last part of our presentation we would like to offer an overview about some future opportunities of metadata enrichment by automatically retrieved information from namespaces and thesauri. In this way we could add a semantic layer to the search and retrieval process of web-archives.

11:40 – 12:00

Rafael Gieschke & Klaus Rechert: Preserving web servers

University of Freiburg

Preserving Web 2.0 sites can be a difficult task. For the most basic Web sites (“static Web pages”), it is sufficient to preserve a bunch of files—a task which can also be done from the outside of the system using a harvester—and serve them using any Web server system. With the advent of the so called Web 2.0, an harvesting approach is limited as theses sites use server-side logic to process the user’s requests. While for some of the cases, especially if the range of inputs is known and fixed, a technique of recording HTTP requests and their respective responses (as used by webrecorder.io) can be employed, for more advanced and especially interactive cases, traditional harvesting techniques have their limitations. These cases include retired content management systems, intranet servers, database-driven Web frontends, scientific (project) Web servers with functional services (WS/REST/SOAP), digital art, etc.

We present a concept for preserving the computer systems running the Web servers themselves instead of only harvesting their output. This approach promises a more complete preservation of the original experience. The preserved systems are then accessible on-demand using the Emulation as a Service framework. One of the main challenges for access workflows is the security of archived machine. As these machines are archived to remain in their original state, a (permanent) Internet connection could be harmful. We present a solution for securely relaying the requests of a user’s Web browser (or any other Web client) to these emulated Web servers. Different access scenarios are supported, e.g., using a current Web browser, orchestrated access using an emulated Web browser (e.g., for Web sites featuring Adobe Flash or Java applications) as well as a “headless” mode for script or workflow integration.

For a more complete user experience, integration of the presented techniques with traditional harvesting techniques is the next necessary next step. For instance, a preserved Web server might itself depend on external data from other Web pages no longer accessible on the live Internet, which, though, have been preserved by a harvester, and vice versa, such that a new level of orchestration for various Web preservation and access methods becomes necessary.

12:00 – 12:20

Gil Hoggarth: The infrastructure behind web archiving at scale

The British Library

This conference promotes the value of Web Archiving and explains the services and tools to create such a system. However, anyone who has ever ventured to put together these components for a production service (or just the preparation for a production service) will appreciate the complexity of the challenge. And before that first production service exists, the size of task – especially in terms of the size of the data being handled – is often underestimated.

This presentation will delve into the conceptual areas of a production Web Archiving service that can manage both the volume of data and the impact that volume has on processing. These high level areas include:

– The management of website targets, crawl dates, access licence/s, inclusion into subject collections
– Web site crawling
– Storage of crawled data as WARC files
– The link between website URLs and WARC records, handled by a CDX service
– Website presentation via a wayback player
– Making the crawled data searchable
– Managing access to the crawled website data

During the presentation, hopefully, an over-arching infrastructure should become clear that will help individuals and institutions alike to appreciate the necessary, and optional, components that make up a Web Archive service. After the presentation, this visual overview will be made available for attendees of the conference to consider and annotate, so that this becomes an enriched document of the components used by the (attending) Web Archive community.

12:20 – 12:30

Q&A

Technical infrastructure