IIPC RSS Webinar: Web Archive Collections as Data

The IIPC Research Speaker Series (RSS) focuses on the research use of web archives and features presentations of use cases, collaborative projects and new tools for researchers. This presentation will explore the National Library of Norway’s Web News Collection as well as the most recent projects related to access and data extraction at the Danish web archive, Netarkivet, through the Royal Danish Library.

Digital Text Analysis of the ‘Web News Collection’

The National Library of Norway (NB) recently released a ‘Web News Collection’ with more than 1.5 million news articles from 268 news websites. The objective is to facilitate digital text analysis and ‘distant reading’ of web news content. Providing content and metadata via an openly available API allows for a multitude of research approaches. Scholars, students and others can tailor their own text corpora, get snippets of text around a keyword (concordances), analyze related words and concepts (collocation analysis), and much more. This talk will:

  • Introduce how text data has been reframed for the purpose of Natural Language Processing (NLP)
  • Show how data and metadata can be retrieved through the open API
  • Show how researchers, students and others can utilize an experimental notebook to build corpora, get text snippets and analyze related words
  • Address current opportunities and limitations of working with the ‘Web News Collection’
  • Sketch out future efforts to make the NB’s Web Archive more accessible for research
BIO: JON CARLSTEDT TØNNESEN

Trained as a digital historian, Jon Carlstedt Tønnessen is a Research Librarian at the National Library of Norway who has brought perspectives from the Humanities into the Norwegian Web Archive. With a passion for end-user-oriented development, he has spent the last three years facilitating researcher access and prototyping tools and services for the research community. Jon is also affiliated with the DH-lab at the National Library of Norway.

 

Access to Netarkivet and Extraction of Data for Researchers

For researchers interested in Denmark’s web history, the Royal Danish Library offers access to Netarkivet, Denmark’s web archive. This collection preserves websites, ensuring historical documentation of the Danish digital landscape. Scholars can request access as well as get data extracted through specific procedures (adhering to legal regulations) and benefit from tools and services designed for data analysis. The archive covers a wide range of materials, from news to social media, reflecting the nation’s cultural and societal evolution online.

For more information, you can explore the official page here.

This talk explores Netarkivet’s procedures and considerations from the researchers’ initial interest in using web archive data to when access is obtained and/or data is eventually delivered. Netarkivet has been involved in projects ranging from analyzing a concept’s trend over the years to extracting almost all text – or all Danish text – for further use in research projects.

BIO: ANDERS KLINDT MYRVOLL

Anders Klindt Myrvoll has been the Programme Manager at the national Danish web archive, Netarkivet, at the Royal Danish Library since 2018. Together with colleagues, he is collecting, preserving and providing access to the Danish web. Prior to web archiving, Anders worked for more than 13 years in the broadcast, film and media industry, collaborating globally on high-end localization, making original content for children, saving digital cultural heritage, strategy, optimization, leadership and much more. You can find him on Linkedin or @andersklindt on X/Twitter.

 

  • 00

    days

  • 00

    hours

  • 00

    minutes

  • 00

    seconds

Date

06 Nov 2024

Time

2:00 PM - 3:00 PM

Local Time

  • Timezone: America/New_York
  • Date: 06 Nov 2024
  • Time: 9:00 AM - 10:00 AM

More Info

Register
Register

Next Event