14:30 – 14:50

Julien Nioche, CameraForensics
Sebastian Nagel, CommonCrawl

Julien Nioche & Sebastian Nagel: StormCrawler at Common Crawl and other use cases

StormCrawler is an open source collection of resources for building low-latency, scalable web crawlers with Apache Storm. This talk will introduce the project and its main features as well as the eco-system of tools that can be used alongside it. We will give a few examples of how it is being used by different organisations around the world with varying volumes of data.

The second part of the talk shortly presents the Common Crawl News data set, a continuously growing collection of news articles from news sites all over the world. We demonstrate how StormCrawler is used as an archiving crawler and is adapted to fit the requirements: utilisation and detection of news feeds and sitemaps, the prioritisation of recently published articles and the challenge to avoid crawling historic news from the archive sections of news sites and agencies.

14:50 – 15:10

Eléonore Alquier: From linear to non-linear broadcast contents: how to think the “audiovisual augmented archive”

French Audiovisual Institute (INA)

The French Audiovisual Institute (INA) has mission to collect, preserve, restore and communicate France’s radio and television heritage. The law of 20 June 1992 has given INA responsibility for the legal deposit of broadcast audiovisual materials. Since 2006 when the French legal deposit was extended to online public web contents, INA dlweb (“dépôt legal du web”) team has been responsible for the collection and preservation of French audiovisual (AV) and media related web contents.

Considered at first as additional documentation for broadcast media, that scope has evolved over time to become a full-fledged media extension, a major replay platform, and even,a broadcasting channel, gradually shaping its own editorial codes and logic. In consequence, INA methods for documenting and accessing TV and Web archives have adapted to the evolving media ecosystem, highlighting interactions between TV, radio and the web, in terms of production and consumption of these media. In parallel as INA archive data models are being redefined and new documentation tools developed, the opportunity to better articulate web and broadcast archives was seized.

Beyond sourcing, archiving and giving access to audiovisual websites since 2009, INA has thus expanded collections to video and social network publications, and developed specific tools and methods to improve user (researcher) experience. This archive is now approaching 80 billion contents, with 880 million tweets and over 2 million hours of videos. Dedicated access tools have been developed, including search engines and assisted browsing. The necessity to enhance the relation between this huge mass of information and INA TV and radio collections has quickly emerged, and the current sourcing, curation and access methods reflect this need, especially regarding the Twitter archive.

This submission aims to present the evolution of (traditionally linear) audiovisual archiving methods, granting the necessity to consider related web contents : why and how broadcasters tend to create non-linear contents, impacts of these new practices on collecting, documenting, curating and accessing this so-called “augmented archive”. Issues will be tackled, such as, how to guarantee coherence of audiovisual collections when a linear medium tends to produce more and more web exclusive contents, or, what is the impact on expected skills for information professionals?

The way broadcasters give and develop new access to their contents has to be taken into account to define, not only the processes of collecting and archiving, but also the design of user interfaces and tools that an archive institution will provide. From collecting to curation, broadcasters’ online practices challenge our ability to adapt to a permanently evolving archive.

15:30 – 15:40


Crawling strategies and tools