IIPC RSS Webinar: Nordic web archives for researchers: access, tools and services
This webinar will bring together speakers from three Nordic web archives to discuss access, tools, and services for researchers.
- Anders Klindt Myrvoll, Netarkivet, Royal Danish Library
- Jon Carlstedt Tønnessen, National Library of Norway
- Sanna Haukkala & Samuli Sairanen, National Library of Finland
Netarkivet, Royal Danish Library
- What we collect based on the Danish legal deposit act, via broad-, selective-, event- and special harvests. A finely meshed net so we get as much relevant content as possible, including focus on content behind paywall and hard-to-crawl social media.
- Who can access the web archive? Currently the archive is only for researchers or PhD students affiliated with a Danish research institutions following an application.
- How researchers can use the archive. For search, discovery, visualization and playback of our archives +1PB data, 3.7 million domains, +40 billion objects, 118.000 billion characters (32TB) in text/content fields, we use open source software, SolrWayback, originating from the Royal Danish Library using the UKWA Solr based warc-indexer framework. It´s possible to free-text search in the entire content of the archive and I’ll give some examples to showcase the strength of this approach.
We are currently also in the process of transitioning to PyWb from OpenWayback – in tandem with SolrWayback, for enhanced playback, especially for sites using post-request like Instagram, Facebook and many other cases. In connection with specific research projects, it is also possible to receive data from the web archive – typically as a paid service using a data agreement. We will have a look at the process as well as a few use cases, from smaller extractions to massive amounts of data.
For more information see:
- https://www.kb.dk/en/policies-and-strategies/royal-danish-librarys-strategy-accession-digital-cultural-heritage under ”Materiale, som offentliggøres i elektroniske kommunikationsnetværk (ustruktureret)” – (only in Danish for now – but will translate well)
Bio: Anders Klindt Myrvoll
Anders Klindt Myrvoll is the Programme Manager at the national Danish web archive, Netarkivet, at the Royal Danish Library since 2018. Together with colleagues, he is collecting, preserving and providing access to the Danish web. Prior to web archiving Anders worked more than 13 years in the broadcast, film and media industry, collaborating globally on high-end localization, making original content for children, saving digital cultural heritage, strategy, optimization, leadership and much more. You can find him at Linkedin or @andersklindt on X/Twitter.
National Library of Norway
The Norwegian Web Archive (NWA) contains valuable sources for research on recent and ongoing transformations of culture and society. However, legal protection of copyright and privacy implies limitations on access and utilization of these data. How does the NWA work to provide access for researchers and develop relevant tools and services?
This talk will give an overview of the NWA’s collection (2001-), how they collect data, and present their current efforts to establish access for researchers. Further, it will share experiences from implementing SolrWayback, and to develop computational tools for analysis on top of that. This involves enriching archival resources with relevant metadata, and striving to align data, access, and services with FAIR principles.
Bio: Jon Carlstedt Tønnessen
Research Librarian, National Library of Norway
Trained as a web historian, Jon has brought perspectives from the Humanities into the Norwegian Web Archive. With a passion for end-user oriented development, he has spent the last two years to facilitate researcher access, and to prototype tools and services for the research community. Jon is also affiliated with the DH-lab at the National Library of Norway.
National Library of Finland
This talk will introduce you to the basics of the Finnish Web Archive and its research use. Online materials have been harvested and deposited by The National Library of Finland since 2006 and they are available to use at legal deposit workstations on the premises of eight dedicated libraries in Finland, covering geographically large percent of the country.
The most recent amendment on The Copyright Act in 2023 has also created a possibility to make online materials available to researchers with new methods. Improving the research use and providing preserved materials to be utilized especially with methods of digital humanities (DH) will be key points in the library strategies in forthcoming years.
This talk will also cover The Finnish language model FinGPT-3 that have utilized the contents of the Finnish Web Archive and The Electronic Legal Deposit Collection.
The Legal Deposit Office: https://www.kansalliskirjasto.fi/en/legal-deposit-office
The index of The Finnish Web Archive: https://verkkoarkisto.kansalliskirjasto.fi/
Turku NLP: https://turkunlp.org/gpt3-finnish
Sanna Haukkala is an information specialist at The Legal Deposit Office of The National Library since 2021, working mostly with the content of The Finnish Web Archive. Sanna has also an extensive work history with printed legal deposits, ephemera collection and music recordings in the National Library of Finland since 2013.
Samuli Sairanen has been an information systems specialist at The Legal Deposit Office since 2019, working with processes and challenges of electronic material management, lately been more involved in web archiving and infrastructure around it.