Sara Aubry & Géraldine Camile, Bibliothèque nationale de France (BnF)
Thomas Drugeon, Institut national de l’Audiovisuel (INA)
Sabine Schostag, Royal Danish Library
From videos to channels: archiving video content on the web
Archiving video content on the web poses particular challenges, and web archiving institutions must use particular technical strategies to not only collect this content but to give access and ensure its long-term preservation. Based on the experience of the Bibliothèque nationale de France (BnF), National Audiovisual Institute (INA) and other institutions, this panel will explore issues raised and different approaches used.
The BnF first crawled videos included in web pages, and from 2008 to 2013 performed a specific crawl of the most-used video platform in France, Dailymotion. The crawl used Heritrix, but it was necessary to use other tools and perform new analyses for each crawl and the BnF was unable to maintain this specific crawl. For the presidential elections in 2017 the BnF subcontracted the crawl of 28 channels on YouTube. The crawl by Internet Memory Research included the web pages, videos and also API metadata. Developments were necessary to include these videos in the preservation workflow, and to provide access.
With the lessons learned from this experience, the BnF was able to perform an in-house crawl of YouTube in 2018, using Heritrix 3 with additional tools to extract metadata and the URL of the video file. The process was included in our standard workflow, simplifying the preservation process. In the BnF access interface, based on OpenWayback, it is possible to view the web pages, with the video in an FLV player that replaces the YouTube player. Metadata collected during the crawl allow the creation of a link between the page and the video file, and also a list of all the videos from different crawls on a same channel.
INA has been continuously collecting videos from platforms since 2008. As of January 2019 we have collected 21 million videos among 16 platforms including Youtube, Twitter, Facebook and main TV/radio broadcast platforms. This represents 2 millions of hours that are made accessible to researchers through a specialized search engine as well as directly from the archived page they were published on. A unified TV/radio/web access is also in the making, giving access to web videos in the same context as the broadcast programs they are related to.
INA automatically crawls videos found embedded in archived web pages or published on one of the 7000 followed channels. Crawling relies on specialized robots developed and maintained in-house, making it easier to follow technical changes in publication methods. Metadata are extracted and normalized whereas videos are kept in their original format. Conversion are operated on the fly by the archive video server when deemed necessary (eg flv to mp4 or webm conversion when flash has to be avoided) to ensure compatibility with the target device without having to resort to batch conversions.
This panel will present different approaches to the challenges of crawling, preserving and giving access to web video content. Speakers will be asked to briefly present a panorama of strategies techniques used, with time kept for discussion between speakers and with the audience to compare the advantages and disadvantages of different approaches and identify means of improvement.