WEB ARCHIVING IN NORWAY
The National Library of Norway (NB) is a legal deposit library that gathers publicly available information to preserve testimonies of Norwegian culture and social life, making them available for research and documentation.
NB is distributed in two locations: Oslo and Mo i Rana. The web archiving team includes staff from both locations.
From Usenet Experiments to Browser-based Crawling
Web archiving in Norway started with NB staff archiving Usenet newsgroups in the 1990s. Among the first archived web resources are the 1999 Local Elections, harvested with the now obsolete NedLib harvester. Joining forces with colleagues in Denmark, Finland, Iceland, and Sweden, NB led the Nordic Web Archive project (2000-2004) to develop software tools for gathering, preserving, and accessing web archive collections. This culminated in cooperation with the Internet Archive to develop the Heritrix crawler, which soon replaced NedLib and other legacy archiving tools.
In principle, the legal mandate to gather and preserve digital information was given already in 1989, with a media-independent Legal Deposit Act. However, between 2009-2012, there were disagreements between NB and the Norwegian Data Protection Agency about how the Legal Deposit Act interacted with the Personal Data Act. In periods, this limited NB’s ability to perform broad crawls. The revised June 2015 Norwegian Legal Deposit Act explicitly stated NB’s mandate to harvest any resource openly available through electronic communication networks, also including content behind paywalls if the content would be subject to legal deposit in other media formats, such as news content or music.
During the 2010s, web content became increasingly dynamic. In a situation without any of the later alternatives - such as Browsertrix - NB started developing its own browser-based crawler named Veidemann. The experiences from this will be presented during WAC2025, in the talk titled "Lessons Learned Building a Crawler From Scratch: The Development and Implementation of Veidemann".
Currently, NB harvests the web with several technologies: we use Veidemann for crawling broad or huge volumes of data. We are also testing Browsertrix Cloud for curated or smaller thematic and event-based crawls.
Towards a Research Infrastructure
In later years, NB has expanded its web archiving efforts to include making data available for researchers. We have a pilot service for searching full-text and metadata based on SolrWayback, developed by the Royal Danish Library in the context of NetArchiveSuite. This can be accessed after application with a relevant research purpose. Further, we have created a collection of Web News harvested between 2019-2022, facilitating distant reading and computational text analysis.
The Research Council of Norway (RCN) has recently indicated it will support the building of a state-of-the-art research infrastructure for the Norwegian Web Archive (NWA). The National Library is currently negotiating terms with RCN and will provide more information about the project when an agreement is in place.
How many in the National Library are doing web archiving?
Currently, seven staff members are allocated to the Web Archiving team. This includes developers, curators and a research librarian/data analyst. The Web Archiving team collaborates with NB’s laboratory for digital humanities (DH-LAB), making data available for various purposes like research and technological development. There are also interactions with NB’s AI-lab, and several librarians outside the team are currently being trained to run smaller, thematic crawls. In total, 12-14 staff members are to some degree involved in the web archiving efforts.