- Leah A. Lievrouw: Web history and the landscape of communication/media research
- John Sheridan: Web archiving the government
- Marc Weber: A Common language PAPER
- Steve Jones: Reviving PLATO: methods and challenges of pre-internet archival research PAPER
- Jane Winters: Demonstrating the value of Internet and web histories PAPER
- Elisabetta Locatelli: The role of Internet Archive in a multi-method research project
- Matthew Weber: The perils and promise of using archived web data for academic research
- Federico Nanni: A diachronic analysis of museum websites: methodology and findings SLIDES
- Ralph Schroeder: Web archives and theories of the web
- Jefferson Bailey: Advancing access and interfaces for research use of web archives SLIDES
- Peter Webster: Utopia, dystopia and Christian ethics: early religious understandings of the web AUDIO RECORDING
- Philip Webster, Claire Newing, Paul Clough & Gianluca Demartini: A temporal exploration of the composition of the UK Government Web Archive SLIDES
- Anat Ben-David & Adam Amram: Of DNS leaks and leaky archival sources: the history of North Korean websites on the Internet Archive
- Harry Raffal: Tracing the online development of the Ministry of Defence (MoD) and the armed forces through the UK Web Archive SLIDES
- Ian Milligan: ‘Pages by kids, for kids’: unlocking childhood and youth history through the geocities web archive SLIDES
- Niels Brügger, Ditte Laursen & Janne Nielsen: Methodological reflections about establishing a corpus of the archived web: the case of the Danish web from 2005 to 2015 PAPER
- Sharon Healy: The web archiving of Irish election campaigns: a case study into the usefulness of the Irish web archive for researchers and historians BLOG POST
- Bolette Jurik & Eld Zierau: Data management of web archive research data PAPER
- Helge Holzmann & Thomas Risse: Accessing web archives from different perspectives with potential synergies PAPER & SLIDES
- Marta Severo: Using web archives for studying cultural heritage collaborative platforms
Sara Day Thomson: Preserving social media: applying principles of digital preservation to social media archiving PAPER & SLIDES
- Andrew Jackson: Digging documents out of the archived web SLIDES
- Nick Ruest & Ian Milligan: Learning to WALK (Web Archives for Longitudinal Knowledge): building a national web archiving collaborative platform SLIDES
- Gregory Wiedeman: Automating access to web archives with APIs and ArchivesSpace SLIDES
- Jefferson Bailey: Who, what, when, where, why, WARC: new tools at the Internet Archive SLIDES
- Michele C. Weigle, Michael L. Nelson, Mat Kelly & John Berlin: Archive what I see now – personal web archiving with WARCS SLIDES
- Lozana Rossenova & Ilya Kreymer: Containerized browsers and archive augmentation SLIDES
- Fernando Melo & João Nobre: Arquivo.pt API: enabling automatic analytics over historical web data SLIDES
- Nicholas Taylor: Lots more LOCKSS for web archiving: boons from the LOCKSS software re-architecture SLIDES
- Jefferson Bailey & Naomi Dushay: WASAPI data transfer APIs: specification, project update, and demonstration SLIDES
- Jack Cushman & Ilya Kreymer: Thinking like a hacker: security issues in web capture and playback SLIDES
- Mat Kelly & David Dias: A collaborative, secure, and private InterPlanetary WayBack web archiving system using IPFS SLIDES
- Abbie Grotke: Oh my, how the archive has grown: Library of Congress challenges and strategies for managing selective harvesting on a domain crawl scale SLIDES
- Ian Cooke: The web archive in the Library: collection policy and web archiving – a British Library perspective
- Kees Teszelszky: The Web Archive as a Rubik’s Cube: the case of the Dutch National Web
- Martin Klein & Herbert Van de Sompel: Using the Memento framework to assess content drift in scholarly communication SLIDES
- Nicholas Taylor: Understanding legal use cases for web archives SLIDES
- Anastasia Aizman & Matt Phillips: Instruments for web archive comparison in Perma.cc
- Alex Thurman & Helena Byrne: Archiving the Rio 2016 Olympics: scaling up IIPC collaborative collection development SLIDES
- Els Breedstraet: Creating a web archive for the EU institutions’ websites: achievements and challenges SLIDES
- Daniel Bicho: Preserving websites of research & development projects SLIDES
- Peter Webster, Chris Fryer & Jennifer Lynch: Understanding the users of the Parliamentary Web Archive: a user research project SLIDES
- Emily Maemura, Nicholas Worby, Christoph Becker, & Ian Milligan: Origin stories: documentation for web archives provenance
- Jackie Dooley, Alexis Antracoli, Karen Stoll Farrell & Deborah Kempe: Developing web archiving metadata best practices to meet user needs SLIDES
- Sabine Schostag: Less is more – reduced broad crawls and augmented selective crawls: a new approach to the legal deposit law SLIDES
- Mar Pérez Morillo & Juan Carlos García Arratia: Building a collaborative Spanish Web Archive and non-print legal deposit
- Karolina Holub, Inge Rudomino & Draženko Celjak: A glance at the past, a look at the future: approaches to collecting Croatian web PAPER
- Jane Winters: Moving into the mainstream: Web archives in the press
- Cynthia Joyce: Keyword “Katrina”: a deep dive through Hurricane Katrina’s unsearchable archive
- Colin Post: The unending lives of net-based artworks: Web archives, browser emulations, and new conceptual frameworks PAPER
- Valérie Schafer & Francesca Musiani: Do web archives have politics?
- Richard Deswarte: What can web link analysis reveal about the nature and rise of euroscepticism in the UK? SLIDES
- Tatjana Seitz: Digital Desolation
- Sally Chambers, Peter Mechant, Sophie Vandepontseele & Nadège Isbergue: Aanslagen, Attentats, Terroranschläge: Developing a special collection for the academic study of the archived web related to the Brussels terrorist attacks in March 2016
- Lucien Castex: The web as a memorial: Real-time commemoration of November 2015 Paris attacks on Twitter
- Gareth Millward: Lessons from lessons from failure with the UK Web Archive – the MMR Crisis, 1998-2010 PAPER
- Caroline Nyvang, Thomas Hvid Kromann & Eld Zierau: Capturing the web at large – a critique of current web citation practices PAPER
- Andrew Jackson: The web archive and the catalogue BLOG POST & SLIDES
- Nicola Bingham: Resource not in archive: understanding the behaviour, borders and gaps of web archive collections SLIDES
- Chris Wemyss: Tracing the virtual community of Hong Kong Britons through the archived web
- David Geiringer & James Baker: The home computer and networked technology: encounters in the Mass Observation Project archive, 1991-2004 SLIDES
- Brendan Power & Svenja Kunze: The 1916 Easter Rising web archive project SLIDES & PAPER
- Maria Ryan: ‘Remembering 1916, Recording 2016’: community collecting at the National Library of Ireland SLIDES
- Pamela Graham: How do we do it?: Collection development for Web Archives SLIDES
- Helena Byrne: A comparative analysis of URLs referenced in British publications relating to London 2012 summer Olympic & Paralympic Games PAPER
- Steven Schneider: Blogging on September 11, 2001: demonstrating a toolkit to facilitate scholarly analysis of objects in web archives
- Federico Nanni: Two different approaches for collecting, analysing and selecting primary sources from web archive collections SLIDES
- Emily Maemura, Christoph Becker & Ian Milligan: Data, process, and results: connecting web archival research elements
- Sigrid Cordell: Diving in: strategies for teaching with web archives
- Jason Webber: The UK Web Archive SHINE dataset as a research tool
- Tommi Jauhiainen, Heidi Jauhiainen & Petteri Veikkolainen: Language identification for creating national web archives SLIDES
- Martin Klein & Herbert Van de Sompel: Robust links – a proposed solution to reference rot in scholarly communication SLIDES
- Shawn M. Jones, Herbert Van de Sompel, Lyudmila Balakireva, Martin Klein, Harihar Shankar & Michael L. Nelson: Uniform access to raw mementos SLIDES
- Sumitra Duncan: NYARC discovery: promoting integrated access to web archive collections
Department of Information Studies, UCLA
Web history and the landscape of communication/media research
Communication and media scholars have recognized computers as powerful tools and contexts for human communication for decades. The first “new media” researchers in the 1970s and 80s sought to describe and explain how people engage with digital technologies and one another in everyday life, work, and leisure, and how networked computing and telecommunications contrasted with traditional broadcasting, publishing, and cinema. As systems developed from the early non-profit, research-driven ARPANET, to the privatized Internet, to the introduction of browsers, search engines and the Web, to social media and ubiquitous mobile devices, apps, “big data” capture and algorithmic media, the study of communication online grew from a minor subspecialty on the fringes of the communication discipline to the dominant problem area of the field.
This talk surveys two major traditions within communication research and scholarship and their articulations with, and implications for, the emerging field of web history. Although both traditions predate the Web (indeed, the Internet itself), each has left its imprint on how digital media and information technologies are studied and understood. The first aligns with the broadly behavioral, sociological, and economic communication research tradition in North America and elsewhere. It focuses on how the Web and digital technologies are used as medium and milieu for communicative action, relations, and organizing. The second tradition aligns with the cultural-critical media studies perspective historically associated with British and European communication scholarship; here, the focus is on analyses of digital technologies themselves as cultural tools, products, and institutions.
Of course, this binary scheme oversimplifies the complex, nuanced relations and interactions between the two traditions. However, the aim is to review major schools of thought in each one, and to develop a basic scheme that may help situate web history scholarship within the landscape of communication and media research.
The National Archives
Web archiving the government
The National Archives provides and maintains the UK Government Web Archive. In this keynote talk, John Sheridan, the Digital Director at The National Archives, will give an overview of the collection, what distinguishes it from other web archives and how it is curated and maintained. He will offer a retrospective of 20 years of government on the web and reflect on the types of research the government web archive enables. The UK Government Web Archive is a vital national resource – a comprehensive and detailed record of government on the web. It is both an archive and itself a part of government on the web, facilitating and enabling changes to the government web estate as well as recording them. Through web archiving The National Archives has learnt vital lessons which have informed its future strategy as a digital archive. Looking to the future, John will talk about The National Archives plans for the web archive as it moves to the cloud and how this will enable new types of research.
INTERNET AND WEB HISTORIES
CHAIR: Niels Brügger, NetLab, Aarhus University SPEAKERS: Marc Weber, Steve Jones and Jane Winters
Computer networks are relatively young, with the first wide deployments in the early 1970s, and are still undergoing considerable change as a distinctive area of communication and media. However, the Internet is already central to many domains of contemporary communication, social, political, and cultural life.
By the early 1990s, the previously research-only Internet had beat out a number of rivals to become the global standard for connecting networks to each other. The World Wide Web in turn beat out a number of other online systems to become the most popular way to navigate information over the Internet. Together, the Web running over the Internet became the basis of our familiar online world.
Work of Internet and Web histories is expanding, and these histories face particular challenges in archives, methods, and concepts that as yet have not been systematically discussed. Add to which Internet and Web histories have only recently found their way into more established, mainstream communication and media history — and indeed historical research in general.
Marking the inauguration of a new journal Internet Histories: Digital Technology, Culture and Society, the panel participants will focus on some of the interface(s) between Internet and Web histories and give examples on how histories of the Internet and the Web are entangled.
Computer History Museum
A Common language
Both the Internet and the Web beat out numerous rivals to become today’s dominant network and online system, respectively. Many of those rival systems and networks developed alternative solutions to issues that face us today, from micropayments to copyright. But few scholars, much less thought leaders, have any real sense of the origins of our online world, or of the many systems which came before. This exclusivity is a problem, since as a society we are now making some of the permanent decisions that will determine how we deal with information for decades and even centuries to come. Those decisions are about regulatory structures, economic models, civil liberties, publishing, and more. This presentation argues for the need to study the comparative architecture of online information systems across all these axes, and to thus develop a “common language” of known precedents and concepts. Doing so depends on two factors: 1) Preservation of enough historical materials about earlier systems to be able to meaningfully examine them. 2) Interdisciplinary, international efforts around the evolution of networks and online systems.
University of Illinois at Urbana-Champaign
Reviving PLATO: methods and challenges of pre-internet archival research
Contrary to the “canon” of the history of TCP/IP and ARPANET, there is no single, simple history of the Internet; instead, it should be seen as a set of multithreaded, parallel histories, most of which are still to be told. The purpose of this presentation will be to shed light on an early digital, networked system that was the hothouse of several applications that would become standards of social computing. PLATO was a pioneering educational computer platform developed at the Computer-based Education Research Laboratory (CERL) at the University of Illinois at Urbana-Champaign in the 1960s and 1970s. It quickly evolved into a communication system used for educational purposes, but also for social interaction (message boards, real time messaging), collaboration and online gaming.
The PLATO system was one of several precursors to today’s internet, but it has been little studied. Recent work examined the values exhibited by PLATO users, and noted that while PLATO “was not originally conceived of as a computer-mediated communication (CMC) platform or device,” it nevertheless “was reframed as a social platform used for educational purposes” (Latzko-Toth and Jones, 2014, p. 1). Its reframing illustrates the value of the study of internet histories and pre-histories (insofar as PLATO and other computer-mediated communication infrastructures like it predated ARPANET), particularly as those histories entail rhetorical discursive elements regarding technical resources, social values, and ethical norms that continue to shape the development of internet technologies. In this presentation I will describe the challenges related to doing historical research on the PLATO network. The focus will be on historical methods but the presentation will also examine interpretive and theoretical challenges associated with mining the history of pre-internet digital networks.
University of London
Demonstrating the value of Internet and web histories
This presentation will explore the challenges involved in demonstrating the value of web archives, and the histories that they embody, beyond media and Internet studies. Given the difficulties of working with such complex archival material, how can researchers in the humanities and social sciences more generally be persuaded to integrate Internet and web histories into their research? How can institutions and organisations be sufficiently convinced of the worth of their own online histories to take steps to preserve them? How can value be demonstrated to the wider general public? And finally how can universities be persuaded to include working with web archives in digital skills training? It will explore public attitudes to personal and institutional Internet histories, barriers to access to web archives — technical, legal and methodological — and the cultural factors within academia that have hindered the penetration of new ways of working with new kinds of primary source. Rather than providing answers, this presentation is intended to provoke discussion and dialogue between the communities for whom Internet and web histories can and should be of significance.
WEB 25: HISTORIES FROM THE FIRST 25 YEARS OF THE WORLD WIDE WEB
CHAIR: Niels Brügger, Aarhus University SPEAKERS: Elisabetta Locatelli, Matthew Weber and Federico Nanni
This panel celebrates the 25th anniversary of the web. Since the beginning of the 1990s the web has played an important role in the development of the internet as well as in the development of most societies at large, from its early grey and blue web pages introducing the hyperlink for a wider public, to today’s uses of the web as an integrated part of our daily lives, of politics, of culture, and more.
Taking as point of departure that the World Wide Web was born between 1989 and 1994, this panel presents some historical case studies of how the web started and how it has developed, alongside with methodological reflections and accounts of how one of the most important source types is provided, namely the archived web.
The three papers of the panel are all part of the edited volume Web 25: Histories from the first 25 years of the World Wide Web (ed. N. Brügger, Peter Lang Publishing, to be published May 2017), and they constitute important steps towards the establishing of web history as a research field.
Università Cattolica del Sacro Cuore & OssCom
The role of Internet Archive in a multi-method research project
If, on the one side, the web offers us a platform where content is searchable and replicable, on the other one, it cannot be forgotten that web content is perishable, unstable and subject to continuous change. This is a challenge for scholarly research about the historical development of web. The research here presented worked about the historical development of weblogs in Italy analyzing their technological, cultural, economic, and institutional dimensions.
The approach chosen mixed participant observation, in-depth interviews, and semiotic analysis of blogs and blog posts. Since an important part of the research was about the development of platforms, graphics, layouts, and technology, beside interviews older versions of blogs were retrieved using Internet Archive. Even if partial versions of the blogs were archived, this part of the research was important to complete data obtained with interviews and blogs’ analysis, since individual memory is not always accurate or some blogs were in the meanwhile closed and original posts were not anymore accessible.
The perils and promise of using archived web data for academic research
The Web today is an amalgamation of interconnected hubs and spokes of information, and increasingly, an integrated mix of media, including text, photos, audio, movies and live-streaming video. Scholars working with Web-based data face numerous critical challenges navigating the complexity of these data. This presentation takes a forward-looking perspective to consider key research problems associated with large-scale Web data that are likely continue to challenge researchers in the future. First, the scope of the Web poses questions with regards to the size and time dimensions of research. Second, the nature of Web data points to questions regarding the reliability and validity of Web data. Third, the type of data on the Web poses significant questions with regards to the ethics research using Web data. This talk will address each challenge in turn by highlighting relevant research and, in turn, outlining proposed schema for scholars working to address these challenges in the context of large-scale Web-based data.
Università di Bologna & Universität Mannheim
A diachronic analysis of museum websites: methodology and findings
Present research on museum websites highlights mainly the vast potential of digital domains in improving communication. Museums can connect with their visitors by providing them with more tools for personalized visits to the institution through their websites. It is argued that websites also boost attendance to the physical museum, and in fact it is of the utmost importance for a museum to have an informative and well laid-out website in order to enhance the potential visitors’ desire to go and see them. However, there is very little literature on using these websites as primary resource to study the recent evolution of the institutions behind them. In our chapter “The changing digital faces of science museums: A diachronic analysis of museum websites” in the edited volume Web 25: Histories from the first 25 years of the World Wide Web, Awesha and I investigated how prominent science museums develop and update their websites over a period of time to communicate better with their visitors. In the panel, I intend to present the methodology we employed for using archived websites as primary sources to trace and examine activities of scientific institutions through the years as well as our findings on how these organisations have projected themselves on the web and have created a narrative regarding their identities.
Collecting and exploring the ‘now’ and the ‘flow’. A case study on Paris attacks archives
CHAIR: Nicola Bingham, The British Library
SPEAKERS: Valérie Schafer (Iscc, Cnrs/Paris-Sorbonne/UPMC), Marie Chouleur (BnF), Louise Merzeau (DICEN, Uni-versité Paris Ouest Nanterre La Défense) and Zeynep Pehlivan (Ina)
The terror attacks that hit France in January 2015, and on November 13th, 2015, sparked intense online activity through websites, live feeds, live Q&As and social media (Twitter or Facebook) (Merzeau, 2015; Badouard, 2016) both during the events, and in the following months.
The Bibliothèque Nationale de France (BnF) and the Institut national de l’audiovisuel (Ina) have quickly reacted to archive this unforeseen flow and intense digital activity using specific methods, and also to adapt access tools to make these archives meaningful to researchers.
How were their real-time collections and archiving processes conceived, adapted and achieved? What were the main issues regarding the capture of the online activity during these disruptive events? More generally, how does real time harvesting and preservation of ephemeral online traces and their subsequent heritage value affect archivists, researchers etc.? Why and how should we document this type of archive for scholarly uses? Which tools are capable of meeting with requirements of an interdisciplinary project conducted by researchers with heterogeneous technical background and diverse questionings?
At the crossroad of both topics of the conference, Creating and Using Web Archives, this one-hour panel aims to enlighten the challenges and opportunities related to the collaboration between Web archivists and social science researchers within the project ASAP (From #jesuischarlie to #offenturen: the born digital heritage and its archiving during the events).
The ASAP project aimed to document the archiving of the Web and of Twitter during the Paris attacks, to question the conditions and possibilities of elaborating and exploring corpora, and to bring out the first elements that can emerge from these massive data.
Marie Chouleur (BnF), Louise Merzeau (DICEN, Université Paris Ouest Nanterre La Défense) Zeynep Pehlivan (Ina) and Valérie Schafer (Iscc, Cnrs/Paris-Sorbonne/UPMC) will show how some challenges were addressed: preserving unusual contents, e.g. tweets archiving, and why the researchers felt the need to enter the black box of this Twitter and Web archiving process (Schafer, Musiani, Borelli, 2016) and to consider it within the broader scope of ephemeral traces; conducting an interdisciplinary collaboration, gathering archivists and researchers in order to explore this abundant digital-born heritage.
- Badouard, R., 2016, « Je ne suis pas Charlie. Pluralité des prises de parole sur le web et les réseaux sociaux », in Lefébure P. et Sécail C., Le défi Charlie. Les médias à l’épreuve des attentats, Lemieux Editeurs.
- Merzeau, L., 2015, « #jesuischarlie ou le médium identité », Médium n°43, Charlie et les autres, 2015/2, https://halshs.archives-ouvertes.fr/halshs-01121510
- Mussou, C., 2012, « Et le Web devint archive : enjeux et défis », Le Temps des Médias 19, pp. 259-266.
- Schafer, V., Musiani, F., Borelli, M., 2016 (2016), « Negotiating the Web of the Past. Web archiving, governance and STS », French Journal for Media Research. http://frenchjournalformediaresearch.com/index.php?id=952
What’s in your web archive? Subject specialist strategies for collection development
CHAIR: Nicola Bingham, The British Library
CONTRIBUTORS: James R. Jacobs, Stanford University, Pamela M. Graham, Columbia University & Kris Kasianovitz, Stanford University
Internet research is a burgeoning scholarly pursuit, with corpus- and textual analysis and “distant reading” being used by scholars to look at their disciplines in new and different ways. Researchers can incorporate a growing volume, type and range of resources into their scholarship through the use and analysis of born digital, web-based sources. Web archiving is consequently being leveraged by libraries to expand their collecting activities and to meet the needs of their research communities for enduring access to online content. As researchers begin to use web archives more regularly, the rationale, method and process of collection development emerges as a critical aspect of assessing the value and validity of the web archive as a primary source. Yet the methods and practices of collection development — the first step in any Web archiving endeavor — have received relatively little attention at Web archiving meetings and conferences. Few guidelines, approaches and “best practices” have been developed and/or documented to guide librarians, curators, archivists and scholars who make critically important content decisions when creating web archives.
Using case studies of specific Web archives — including the End of Term crawl, the CA.gov cooperative collection, the Fugitive Documents archive-it collection, and the Human Rights Web Archive — this panel aims to forefront the activity to a conscious level within the Web archiving community and seeks to begin the discussion about this critical piece of the Web archiving workflow and lifecycle.
This panel will begin the discussion to answer the following questions:
- How should the Web archiving community conceptualize collection development in the web archiving process?
- How do we develop collection policies and document policies and approaches for web archive end users?
- How do collecting strategies differ for thematic vs. domain-based vs. event focused collecting? How does collection development for web archives resemble/differ from more traditional forms of collecting (i.e. print, licensed digital resources, etc.)?
- What are some approaches to collection development? Is Web archiving collection development best done thematically, by domain, or by event? How do curators and subject specialists select and create content for web archives?
- What is/are the role(s) of the subject specialist in the Web archiving workflow? What *should* that role entail? How can subject specialists integrate web archives into their libraries’ discovery environment?
- What is the current state of the Web archiving toolset for subject specialists? What tools are needed for collection developers and usage of web archives by subject specialists?
- How do the costs (staff time, storage space, etc.) of web archiving impact the selection and collection decisions for these collections.
Bodleian Libraries, Oxford University
Web archives and theories of the web
The web has been with us for more than a quarter century. Web archiving yields important clues about social changes during this period, and they will become more important as life moves ever more online. The web has thus clearly become a major medium, but it is equally clear that it does not fit existing theories of mass or interpersonal media. The main discipline that might otherwise, apart from media and communication studies, provide a theory of the web, is information science, but this perspective offers only limited insights into web uses.
The perspective that will be developed in this paper is that the web is a new information infrastructure – complimentary to, but also shaping, new patterns of how people use other media in their everyday lives. Apart from outlining how the web extends other media infrastructures, the paper will conceptualize how seeking or receiving information has changed peoples’ daily routines. It will review what is currently known about this information intake, including the types of information people seek and consume, and ways to categorize these by distinguishing, for example, between leisure, health, political, and other types of information. It will also examine different groups that have been studied in different parts of the world, just as other media have been studied in different cultural settings. A further discussion is devoted to the shifting balance between text, sound, image and moving image: what are the implications of the fact that young people, for example, spend more time with sites like YouTube for a variety of purposes, and less with text?
These questions can be addressed from a comparative-historical perspective: how have web usage patterns changed? Studies of user behaviour are one approach here, but another approach is to examine the shifting shape of the web itself: how has its content changed? There are now a number of studies that have tried to chart its changing contours, for example by measuring different domains and subdomains. Second, how interconnected is the web, and how has this changed over time? Patterns of content and density of the web have implications for whether web content is more global, for example, than traditional media? Here some studies suggest that, contrary to popular perceptions, the web is quite nationally bounded, like traditional media. Finally, there is the question of where to draw the boundaries around the web and capture its protean nature, for example with the proliferation of apps.
The paper is based on a comprehensive synthesis of existing research, and will conclude with implications for archiving: the web will become the main way that people find information in the 21st century. It may seem that trying to fit the rapidly changing and growing web into an overarching analytical framework is futile. But a case can also be made for the opposite view: we can only understand the web as a historical record if we pin down how the web fits into peoples’ overall information and communication uses, placing these within an appropriate theoretical framework.
Advancing access and interfaces for research use of web archives
The Internet Archive has been archiving broad portions of the global web for 20 years. This historical dataset, currently totaling over 13 petabytes of data, offers unparalleled insight into how the web has evolved over time. In the past year, the Internet Archive has been building new tools, datasets, and access methods to advance how researchers can explore and study this collection at scale. This presentation will showcase a range of cutting-edge technical advancements across statistical, analytical, informational, and discoverability features that can be leveraged by researchers to study the origin, evolution, and content of IA’s, and other institutions’, web collections — from full national domains to subject-based archives.
The presentation will highlight a number of innovative projects both in production and in R&D stage and will link specific tools to the broader effort to advance the usability of web archives for computational research and more generalized, content-specific accessibility. Novel initiatives that will be featured include:
– New access tools: Highlighting new content-based access portals such as GifCities (http://gifcities.org/) that use data mining, content extraction, and search tools to provide unique access points into historic platforms and create new entry points into disappeared websites.
– Web domain profiling: Analyzing and providing API-based access to summary information about the scale, scope, and characteristics of web domains as represented in the archive, including the appearance and disappearance of hosts and domains over time and historical mime and media type distribution.
Collaborative Archives: Exploring multi-institutional and citizen-engagement models for collecting at-risk web content and providing computing resources and organizational partnerships that support researcher data mining and study of web data at scale.
– Content Analysis: Utilizing tools such as natural language process and named entity extraction for indexing purposes to allow new insights to the webpage content and using tools like simhash to represent longitudinal URL content change.
– Data Processing: New frameworks, such as ArchiveSpark, for the extraction of web content for use in analysis and the use of such tools for the generation of derivative datasets that can support specific methods of scholarly inquiry and analysis as well as the use of web applications like Jupyter Notebooks allowing in-browser analysis of web archives.
Taken together, these projects will illuminate a variety of approaches, both technical and user-driven, that accelerate how the broach custodial and scholarly communities conceptualize the methods it can facilitate access to, and use of, web archives.
Webster Research & Consulting Ltd.
Utopia, dystopia and Christian ethics: early religious understandings of the web
It has been noted more than once that both the Internet and the Web have been the subject of overarching projections of cultural and social aspirations and fears, utopian and dystopian. The Internet has been feted as a great disruptor: a solvent of established privilege and the outlet for previously marginal opinions; a liberator of suppressed creative energy, in politics, commerce and the arts. It has equally well been denounced as the harbour of criminality, the accelerator of falsehood, the destroyer of traditional industries, communities, languages and cultures. But both positive and negative discourses of the Web have often been expressed in both implicit and explicit theological – or at the very least – ethical and philosophical terms.
Using a combination of the archived Web itself as it evolved over time, and offline commentary that accompanied, applauded, criticised and indeed preceded it, this paper examines the several analytical categories by means of which Christian commentators in Europe and North America have sought to understand the online experience: the nature and capabilities of the human person; appropriate forms of human interaction and the nature of community; and the economic and social effects on industries, countries and individuals. It will show that these concerns went beyond simple Luddism or concern about particular kinds of content such as pornography. It will show the continuity of these debates with earlier theological and ethical writing about early computing, and how they changed over the history of the Web. Finally, it will explore the degree to which secular utopian and dystopian writing about the Web owed its conceptual vocabulary to these older religious traditions.
Philip Webster, Webster Research & Consulting Ltd., Paul Clough & Gianluca Demartini, University of Sheffield & Claire Newing, the National Archives
A temporal exploration of the composition of the UK Government Web Archive
Hosted by The National Archives, the UK Government Web Archive (UKGWA) contains the historical record of the Web presence of UK central government, spanning a period from 1996 to the present day, and consists of over 4 billion entries, including text, images, documents, multimedia content and associated structural elements such as stylesheets and embedded code. This archive data is indexed using the CDX index format, a common archival file format that provides rich metadata about archive entries.
While many research projects focus on the content of Web archives, the scale of major Web archival endeavours also present researchers with a rich source of metadata that can be an interesting source of insights in their own right. This metadata covers the temporal range of the archive, and provides a variety of data facets such as file sizes, data types, proliferation of URLs, and the ascendancy and decline of various Web technologies and file formats.
This paper describes a study that was performed using CDX metadata for the entire UK Government Web Archive, using temporal analysis to show the changing nature of the UKGWA over time, performed using a performance-optimised relational database management system implementation of the CDX index.
The Open University of Israel
Of DNS leaks and leaky archival sources: the history of North Korean websites on the Internet Archive
The Internet Archive’s Wayback Machine is considered a born-digital repository of historical facts. Yet end-users of the Wayback Machine know very little about the knowledge production processes that construct archived snapshots as primary sources and as evidence. Recently, the Wayback Machine introduced a new beta feature that adds provenance information about the collection of web captures associated with the specific web crawl the capture came from. In this paper, we study the Wayback Machine’s provenance feature as a means to unravel the complex socio-technical knowledge production process that constructs Website captures as historical facts. As a case study, we focus on the rare captures of North Korean Websites on the Wayback Machine.
Although the .kp domain was delegated to North Korea in 2007, until recently little was known about its Websites due to the country’s restrictive Internet policies. On 20 September 2016, an error in the configuration of North Korea’s name servers allowed the world to have a rare glimpse of 28 Websites hosted in the .kp domain. However, the Wayback Machine displays captures of these Websites from as early as 2010. How did the Internet Archive come to ‘know’ about the existence of the North Korean Websites years before the DNS leak? By narrating the history of the diverse sources that ‘informed’ the Internet Archive about the North Korean Websites over time, we argue that although most of the Websites have been contributed to the Internet Archive’s crawler by experts, activists and Wikipedians, the Internet Archive’s combination of distributed and automated crawling system with manifold source contributions result in a crowd-sourced archive that circumvents Internet censorship, and that gradually aggregates knowledge that is otherwise known only through DNS misconfigurations or other ‘leaks’.
University of Hull
Tracing the online development of the Ministry of Defence (MoD) and the armed forces through the UK Web Archive
Abstract: Examining how the MoD and Armed Forces have developed their online identity provides an insight into the issues they regarded as a priority. Since 1996 the MoD and Armed Forces have been faced with many challenges, in particular reduced funding, controversies regarding the procurement of equipment, participation in unpopular conflicts and the restructuring of the British Army’s volunteer reserve force. By using the UK Web Archives it is possible to assess the extent to which these issues were reflected on the .mod.uk domain as well as determine how far traditional concerns such as recruitment have been the main influence in the online development of the MoD and Armed Forces.
This paper will discuss the need for an explicit search strategy when researching sources in the UK Web Archives. By initially adopting a qualitative approach, where five iterations of the websites of the MoD and Armed Forces were researched, it was possible to identify key concepts as the basis for search queries and gradually create a larger caucus of material. These initial findings also helped interpret the quantitative data which emerged from the link analysis conducted upon the JISC UK Web Domain Dataset (1996-2010) which was then used to create further queries. This method allowed for theories regarding the development of the .mod.uk domain to be created and then tested on a gradually increasing basis.
The paper will demonstrate how it has been possible to trace the online development of the MoD and Armed Forces through adopting a number of methodological approaches to interpreting web content. Discussion will concentrate on how to value non-traditional content such as interactive game elements, site navigation and the location of content upon a webpage. The use of qualitative thematic coding of a webpage in order to visualise the weight given to particular concepts will be discussed. The interpretation and contextualising of a webpage through its machine parsable metadata will be considered in the context of the MoD and Armed Forces websites. The paper will suggest, however, that there cannot be one interpretive method applied for the entire time period due to evolving standards in web design. It will be argued that as web designers gained a more sophisticated understanding of how users navigate the internet the layout of content changed forcing us to reappraise how we weight the value of information and navigational elements on a given page. This paper will also draw upon link analysis to consider the extent to which the Armed Forces and the MoD have established their online presence and whether this reflects their larger digital strategy. The online relationships that the Armed Forces and various agencies on the .mod.uk domain have established with other websites captured in the UK Web Archive will also be examined.
By combining a range of methods to interpreting source data this paper will argue that recruitment, the maintenance of an online corporate image and perception management have consistently been the main drivers behind the online development of the MoD and Armed Forces.
University of Waterloo
‘Pages by kids, for kids’: unlocking childhood and youth history through the geocities web archive
“Welcome to the Enchanted Forest,” the site’s welcome page declared, “home of the littlest GeoCitizens and some of the best homepages. This is a neighbourhood for pages by kids, for kids – from animals to zephyrs, the Forest has everything that a young mind could need.” An experiment in online publishing was playing out, amidst a broader social context of fears around privacy, undifferentiated accessibility to adult content, and online exploitation: the GeoCities’ Enchanted Forest. This online community had everything that a child, parent, or educator might want: the ability to reach large audiences as the potential of the Web promised, but patrolled by volunteer community leaders. Spirit-building awards and contests were held, as well as website building and other resources at virtual community centres. Represented and articulated as a cityscape of suburbs, the Enchanted Forest had a dark side, particularly around corporations who saw children as an avenue into the living rooms of the western world.
GeoCities was a website, existing between 1994 and 2009, that allowed people to create personal homepages on any topic of choice. Users were clustered in interest-based ‘neighbourhoods’ between 1994 and 1999, allowing the exploration of specific communities within the archive. Our team received a 4TB collection of WARC files from the Internet Archive, allowing us to explore them as historical resources.
This presentation argues that we can see two things in the Enchanted Forest.
First, we can explore how children connected with each other online and articulated a vision of community that was an alternative to the Web’s wide-open frontier. It argues that through GeoCities’s infrastructure, children created their own world that can offer insight into their interests, activities, and thoughts in the late 1990s; these can be seen through a combination of reading individual websites as well as leveraging digital techniques to explore the collection through hyperlinks. In this analysis we see the changing role of children, and how the Web was a contested space around free expression and identity formation. Some of this openness changed after widespread fears around online privacy, exploitation, and a high-profile American Federal Trade Commission (FTC) case. As a historian of childhood and youth, traditionally a demographic that has left behind little primary source material (save recollections filtered through the lens of adults), being able to tell this story through children and their families themselves is a critical one.
Secondly, the Enchanted Forest and GeoCities more broadly gives us a sense of the scale of data that we will soon confront as we enter the web age of historiography. To give a sense of this, in late 1997 approximately 200,000 GeoCities “homesteaders” were between the ages of 3 and 15. When we count the Enchanted Forest archive, we find that there were some 740,096 Uniform Resource Locators (URLs) in total, with 170,659,422 words. Beyond the stories here are the glimmers of a methodology necessary to uncover stories from the dawn of the Web age in the 1990s and beyond. I will also present our research method in this paper alongside substantive findings.
Niels Brügger, NetLab, Aarhus University, Ditte Laursen, The Royal Danish Library & Janne Nielsen, Aarhus University
Methodological reflections about establishing a corpus of the archived web: The case of the Danish web from 2005 to 2015
To understand the growing importance of the web and the role the web plays today, it is essential to study the historical development of the web. A historical study of the characteristics of web is relevant both in itself and because it can serve as a baseline for other web studies, for instance by making it possible to determine whether a specific website at a given point in time was comparatively large or small, dynamic or static etc. Even though the global reach of web is, arguably, one of the central characteristics of web, web also has a strong national aspect in the everyday experience of the average web user – or at least, this is the case for Danish web users. This paper presents methodological findings from an analysis of the historical development of the national Danish web from 2005 to 2015, centered around the main research question: What has the entire Danish web looked like in the past, and how has it developed? The empirical basis for the presentation is the Danish web from 2005 to 2015 as it is archived in the national Danish web archive Netarchive. Our findings are centered on selected quantitative characteristics, e.g. size, structure and content, in order to do comparisons across the years (and later international comparisons). Specifically, we study how three different ways of approaching the archive each in their own way provide both limited and privileged access to the archived data. The three different ways are: 1) using the complete archive as a basis for the analysis, 2) using the full-text index as a basis for the analysis, and 3) using a created corpus as the bases for the analysis comprising only one version of each web entity. Posing the same research questions in relation to the development over 10 years of domain sizes and file sizes, number of domains and files, and mimetypes, we discuss advantages and disadvantages of the three methods.
Humanities Research Institute, Maynooth University
The web archiving of Irish election campaigns: a case study into the usefulness of the Irish web archive for researchers and historians
Election campaigns are a central component of any democratic electoral system. As a subject domain, election campaigns attract attention from across the social sciences and in particular from historians, memoirists, and scholars in journalism, politics, and communications. In the digital age, it seems judicious to argue that candidate’s websites and their accompanying digital resources are deserving of equal scrutiny to other published forms of election campaign literature. However, unlike most published forms of campaign literature, websites belonging to election candidates are capable of being altered spontaneously throughout a campaign. This dynamic may occur for several reasons, such as a response to comments and feedback from the electorate, the tactics of the political opposition, or in some cases, as a direct result of political scandals. Indeed, in the aftermath of an election, websites of election candidates often disappear to the sphere of the ‘HTTP 404 Not Found’ error. In addition, the content of government and political party websites is often modified to coincide with an election campaign, and notably, government departmental websites are prone to undergo change due to a turnover in government after an election. Thus, it may be argued that the web archiving of election campaigns is a planned response aimed at capturing a political campaign in flux. Indeed, it may be further argued that the web archiving of election campaigns and government websites is essential for the preservation of the historical, cultural, social and political record; analysis of democratic processes over periods of time, and the study of political and democratic history for future generations.
Many institutions began their web archiving endeavours with focused thematic crawls for web content during election campaigns. Yet, the usefulness of election campaign collections in a web archive for researchers and historians is relatively understudied. The first instance of a web archiving initiative in Ireland was conducted by the National Library of Ireland (NLI) in 2011, and coincided with the 2011 General Election. Since then, the NLI has taken great strides to secure a web archiving programme for the capture of Irish online social, cultural and political heritage. This paper presents a case study which examines the efficacy of the Irish web archive for the study of election campaigns and political history.
The Royal Danish Library
Data management of web archive research data
In general, researchers are required to ensure data management of their research data in order to make their results traceable and reproducible after the research has been finished. Such data management must therefore include precise and persistent identification of the data, where long term preservation is needed to obtain persistency. As for all digital material, this poses many challenges for research data from a web archive.
This paper will provide recommendations to overcome various challenges for web material data management. The recommendations are based on results from independent Danish research projects with different requirements to data management:
– A context research project focused on data management for web archive data found for a context of a literary work. This project required archived web references needing high precision on a par with traditional references for analogue material. Furthermore, the materials were found in different web archives .
A corpora research project on data management for web archive data forming large portions of a web archive. The project required large corpora (collections) of archived web references as basis for analysis of a Nation’s Web Domain, i.e. corpora from a single web archive .
The need for precision and persistency in global references required by the context research project resulted in presentation of a recent new standard Persistent Web Identifier (here called pwid)[3,4]. In short, the suggested pwid includes four main elements: Unique web archive identification, unique reference to resource by http URI and harvest date/time, as well as precision of what is referenced (web element or web page). This standard can be used for single references for both the research cases.
Preservation of corpora is here only considered as corpora of references, since the data themselves are too big to be preserved in parallel. In the corpora research project the focus is how to specify and preserve a specification of a corpus.
The Legal aspect of references is also crucial and is particularly challenging for large reference collections like corpora where it is too time consuming to check whether there are personal sensitive data represented in a URI for a web reference. In the corpora research project different models were considered. The conclusion is that precision and persistency can only be achieved by having an index with a reference for each item as part of the corpus description, since alternative solutions with URI anonymisation or extraction algorithms poses too many risks of data loss.
The joint conclusion and recommendation is here to extend the pwid definition to allow corpus/collection references to a preserved collection definition in line with other web material. This extension simply consists in allowing a pwid URI to specify “collection” as a content specification (as folgaor page and element) along with a collection identifier (instead of http URI).
Further benefits can be made from having a standardized way of specifying a corpus. Therefore this paper will also include a suggestion for such specifications using pwid URIs for collection items.
References Oversigt over uddelte midler fra Kulturministeriets Forskningsudvalg 2015 (translation to english: Overview of granted resources from the research board, Ministry of Culture 2015), row 25,
Persistent web id: pwid:archive.org:2015-09-27_21.15.59:http://kum.dk/fileadmin/KUM/Documents/Kulturpolitik/Forskning/KFU._Oversigt_over_uddelte_midler_2015.pdf Practical Data Management Project Description, Persistent web id: pwid:2016-11-21_08.53.33:https://en.statsbiblioteket.dk/data-management/practical-data-management  Zierau, E., Nyvang, C., Kromann, T.H.: “Persistent Web References – Best Practices and New Suggestions”, In proceedings of the 13th International Conference on Preservation of Digital Objects (iPres) 2016, pp. 237-246, Persistent web id: pwid:archive.org:2016-10-12_14.15.31: http://www.ipres2016.ch/frontend/organizers/media/iPRES2016/_PDF/IPR16.Proceedings_4_Web_Broschuere_Link.pdf  Provisional wpid URI scheme registration at IANA, Persistent web id: wpid:2016-12-01_11.32.42: http://www.iana.org/assignments/uri-schemes/prov/pwid
L3S Research Center, Hannover
Accessing web archives from different perspectives with potential synergies
Web archives constitute a valuable source for researchers from many disciplines. However, their sheer size, the typically broad scope and their temporal dimension make them difficult to work with. We have identified three approaches to access and explore Web archives from different perspectives: user, data and graph centric.
The natural way to look at the information in a Web archive is through a Web browser, just like users do on the live Web. This is what we consider the ‘user view’. The most common way to access a Web archives from a user’s perspective is the Wayback Machine, the Internet Archive’s replay tool to render archived webpages. Those pages are identified by their URL and a timestamp, referring to a particular version of the page. To facilitate the discovery of a page if the URL is unknown, different approaches to search Web archives by keywords have been proposed [1, 2, 3]. Another way for users to find and access archived pages is by linking past information on the current Web to the corresponding evidence in a Web archive [4, 5].
Besides accessing a Web archive by closely reading pages, like users do, the contents can be analyzed through distant reading, too. This ‘data view’ does not necessarily consider webpages as self-contained units with a layout and embeds, but the contents can be considered as raw data, such as text or image data. A question like “What persons appear together most frequently in a specific period of time?” is only one example of what can be analyzed from the archived Web. Typically this is not done on a whole archive, but only on pages from a specific time period as well as on specific data types or other facets that need to be filtered first. With ArchiveSpark we have developed a tool for building those research corpora from Web archives that operates on standard formats and facilitates the process of filtering as well as data extraction and derivation at scale in a very efficient manner .
The third perspective, besides the ‘user’ and ‘data’ views, is what we call the ‘graph view’. Here, single pages or websites, consisting of multiple pages, are considered nodes in a graph, without taking their contents into account. Links among pages are represented by edges between the nodes in such a graph. This structural perspective allows completely different kinds of analysis, like centrality computations with algorithms such as PageRank.
We present the latest achievements from all three views as well as synergies among them. For instance, important websites that can be identified from the graph centric perspective may be of particular interest for the users of a Web archive. Furthermore, the ‘user view’ is often just a starting point for a much more comprehensive data study. Hence, those views can be considered as different zoom levels to look at the same Web archive data from different perspectives.
References: Holzmann, H., Anand, A.: Tempas: Temporal Archive Search Based on Tags. 25th International Conference Companion on World Wide Web, WWW 2016. Montreal, Quebec, Canada (2016).  Holzmann, H., Nejdl, W., Anand, A.: On the Applicability of Delicious for Temporal Search on Web Archives. 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2016. Pisa, Italy (2016).  Kanhabua, N., Kemkes, P., Nejdl, W., Nguyen, T.N., Reis, F., Tran, N.K.: How to Search the Internet Archive Without Indexing It. 20th International Conference on Theory and Practice of Digital Libraries, TPDL 2016, Hannover, Germany (2016).  Holzmann, H., Runnwerth, M., Sperber, W.: Linking Mathematical Software in Web Archives. 5th International Congress on Mathematical Software, ICMS 2016. Berlin, Germany (2016).  Holzmann, H., Sperber, W., Runnwerth, M.: Archiving Software Surrogates on the Web for Future Reference. 20th International Conference on Theory and Practice of Digital Libraries, TPDL 2016, Hannover, Germany (2016).  Holzmann, H., Goel, V., Anand, A.: ArchiveSpark: Efficient Web Archive Access, Extraction and Derivation. 16th ACM/IEEE-CS on Joint Conference on Digital Libraries, JCDL 2016. Newark, New Jersey, USA (2016)
University of Paris Nanterre
Using web archives for studying cultural heritage collaborative platforms
In the last few years, cultural institutions have launched several experiments in order to transform their registers into transparent, open and participative documents available on the web. As an example, the French Ministry of Culture has recently launched two projects. The JocondeLab website (http://jocondelab.iri-research.org) makes available a part of the national museums’ inventories in a multilingual version. This platform relies on the functionality of the semantic web by offering access to 300,000 notices through the structured and open format of DBPedia and the integration with the contents of Wikipedia. In 2015, the Ethnopole InOc Aquitaine started the PCILab project (online in October 2017) for sharing the French inventory of the intangible cultural heritage through Wikipedia. Focusing on intangible cultural heritage, we can cite other important projects in other countries. In Scotland, the Edinburgh Napier University launched the first wiki related to intangible heritage in 2010 and the portal is still active, managed by the Museums Galleries Scotland (http://ichscotland.org). In Finland, the National Board of Antiquities opened a participative platform in 2016 that today collects 80 living heritage practices (https://wiki.aineetonkulttuuriperinto.fi/). There are also crowdsourcing projects that are growing up outside the institutional context, such as Wiki Loves Monuments project (http://www.wikilovesmonuments.org/), a photo contest that every year allows to collect thousands of photos that today some institutions have decided to integrate to official registers (for example the Official Inventory of Architectural Heritage of Catalonia).
All these platforms introduce new ways of collaborative management of cultural heritage through the creation of participative pages corresponding to the inventory records directly on Wikipedia or on ad hoc platforms. As an effect, cultural heritage specialists (archivists, anthropologists, etc.) are increasingly solicited to contribute to these collaborative platforms while citizens are also becoming familiar with these new possibilities of expression about cultural heritage. This communication aims at studying these new forms of collaborative management of cultural heritage based on the use of wiki platforms. Past studies on this topic are organized mainly around two poles: analyses of computer and technical solutions, on the one hand, and researches on changes in the relationship between institutions and publics, on the other hand. Differently, this study is meant to focus on cultural heritage and notably on the collaborative digital writing around heritage objects that take shape on the web. Our ideal goal would be to study, through a historical perspective, how cultural heritage objects included in these inventories have evolved in the last few years as an effect of their opening on the web through wiki platforms. The objects will not be considered in relation to the inventory record, but as digital objects resulting from the editorialization processes involving heritage professionals, but also other users of the web.
In order to do this, we intend to carry out a diachronic qualitative analysis of digital writings concerning heritage objects by analyzing the process of editorialization of web pages that talk about the object through web archives and the history page of Wikipedia. However, this study has firstly to cope with some important methodological issues. Indeed, wiki pages are really complex objects for web archives. Changes in pages are rarely collected in the harvesting process. Captures of these websites are quite incomplete. Taking into account these limitations, this communication will investigate how web archives can be used to investigate the historical development of the collaborative platforms related to cultural heritage. Special attention will be paid to the possible combinations of information retrieved in the web archives and information retrieved in the history pages of the wiki, when it is available. The paper will summarize the perspectives of analysis that web archives realistically open in this field.
Digital Preservation Coalition
Preserving social media: applying principles of digital preservation to social media archiving
In a report released in 2013, the UK Data Forum projected growing importance for social media research and released a call for further development:
‘Through social media, millions of human interactions occur and are recorded daily, creating massive data resources which have the potential to aid our understanding of patterns of social behaviour […] . Social media analytics represent an opportunity to invest in extensive, large scale social research that is within a temporal frame that cannot be achieved through either snapshot surveys or interviews or via longitudinal panel research’ (UK Data Forum, 2013).
Through this statement, the UK Data Forum characterises a trend in social media research in social science and economics but also computer science, marketing, and a quickly broadening range of other disciplines (Weller and Kinder-Kurlanda, 2016). Researchers and research institutes across the UK (and the rest of the world) eagerly pursue new insights revealed by this new and novel form of data. This burst of activity in social media research mirrors analogous trends in the corporate sector. Commercial institutions increasingly look to social media data to collect and analyse information about demographic groups and individual consumers (ADMA, 2013).
In the rush to exploit this new source of data we must not overlook the importance of ensuring long-term access. The issues of archiving social media—an extension of web archiving—reflect many of the challenges addressed by digital preservation, including external dependencies, personal data and individual privacy, ownership, and scalability (DPC, 2016b). This is not surprising considering the capture of social media is essentially an act of digital preservation – taking action to ensure future access to otherwise temporary or vulnerable digital material. This shared enterprise has existed between web archiving and digital preservation for a long time. This relationship was recently publicly recognised with the award of the first ever Digital Preservation Coalition Fellowship for lifetime achievement in digital preservation to Brewster Kahle of the Internet Archive (DPC, 2016a).
However, despite this long relationship, the future of research using social media data depends on current professionals addressing the issues of long-term preservation. While the more immediate difficulties of access and use often dominate the conversation about social media data, the issues of long-term preservation also demand immediate attention. Most forms of social media content are vulnerable to loss and owned by platforms with no contractual or legal requirement to preserve individual users’ data.
The Digital Preservation Coalition Technology Watch Report ‘Preserving Social Media’ lays out these issues alongside case studies that demonstrate successful approaches to ensuring long-term access (Thomson, 2016). This paper presents an overview of that report with updated case studies and developments. The author invites scholars and researchers, in academe and industry, to take a long view of their data. Simultaneously, it encourages information practitioners (research data managers, librarians, archivists) to consider the needs of researchers (and future researchers). It lays out a path to collaboration to build a more sustainable approach to the preservation of social media.
- Association for Data-driven Marketing and Advertising (ADMA), 2013, ‘Best Practice Guideline: Big Data’, http://www.admaknowledgelab.com.au/compliance/compliance-help/general/data-and-privacy/codesand-guides/best-practice-guideline-big-data
- Digital Preservation Coalition (DPC), 2016a, ‘Digital Preservation Awards 2016 – Winners Announced!’, http://dpconline.org/advocacy/awards/2016-digital-preservation-awards
- Digital Preservation Coalition (DPC), 2016b, ‘Digital Preservation Handbook’, http://handbook.dpconline.org/digital-preservation/preservation-issues
- Thomson, S, 2016, Preserving Social Media, DPC Technology Watch Report, DOI: http://dx.doi.org/10.7207/twr16-01
- UK Data Forum 2013, ‘UK Strategy for Data Resources for Social and Economic Research’, http://www.esrc.ac.uk/files/news-events-and-publications/news/2013/uk-strategy-for-data-resources-forsocial-and-economic-research/
- Weller, K and Kinder-Kurlanda, K, 2016, ‘A Manifesto for Data Sharing in Social Media Research’, DOI: 10.1145/2908131.2908172, http://dl.acm.org/citation.cfm?doid=2908131.2908172
The British Library
Digging documents out of the archived web
As an increasing number of government and other publications move towards online-only publication, we are force to move our traditional Legal Deposit processes based on cataloging printed media. As we are already tasked with archiving UK web publications, the question is not so much ‘how to we collect these documents?’ rather ‘how to we find the documents we’ve already collected?’. This presentation will explore the issues we’ve uncovered as we’ve sought to integrate our web archives with our traditional document cataloguing processes, especially around official publications and e-journals. Our current Document Harvester will be described, and it’s advantages and limitations explored. Our current methods for exploiting machine-generated metadata will be discussed, and an outline of our future plans for this type of work will be presented.
Nick Ruest, York University Libraries and Ian Milligan, University of Waterloo
Learning to WALK (Web Archives for Longitudinal Knowledge): building a national web archiving collaborative platform
In the absence of a national web archiving strategy, Canadian governments, universities, and cultural heritage institutions have pursued disparate web archival collecting strategies. Carried out generally through contracts with the Internet Archive’s Archive-It services, these medium-sized collections (estimated around 30-35TBs) amount to a significant chunk of Canada’s born-digital cultural heritage since 2005. While there has been some collaboration between institutions, notably via the Council of Prairie and Pacific University Libraries (COPPUL) in western Canada, most web archiving collecting has been taking place in silos. Researchers seeking to use web archives in Canada are thus limited not only to the Archive-It search portal, but also to exploring on a silo-ed collection-by-collection basis. Given the growing importance of web archives for scholarly research, our project aims to break down silos and generate a common search portal and derivative dataset provider for web archiving research in Canada.
Our Web Archiving for Longitudinal Knowledge (WALK) Project, housed at http://webarchives.ca and with our main activity via our GitHub repo at https://github.com/web-archive-group/WALK, has been bringing together Canadian partners to integrate web archival collections. Co-directed by a historian and a librarian, the project brings together computer scientists working on the warcbase project, doctoral students working on governance issues, and students running tests and usability improvements. Our workflow consists of:
– Signing Memorandum of Agreements (MOU) with partner institutions;
– Gathering WARCs from partner institutions into ComputeCanada infrastructure;
– Using warcbase to generate scholarly derivatives, such as domain counts, link graphs, and files that can be loaded directly into network analysis software such as Gephi;
– Adapting the Blacklight front end to serve as a replacement for our current SHINE interface; this will allow built-in APIs, faceted search by institution, and inter-operability with Canadian university library catalogues;
– Finally, using multiple correspondence analysis, generating profiles of each web archive with an eye towards assisting curators in finding gaps between institutional coverage (i.e. several web archives are collecting the same domain in the absence of a national strategy).
This presentation provides an overview of the WALK project, focusing specifically on the questions of interdisciplinary collaboration, workflow, dataset creation and dissemination. As web archiving increasingly happens at the institutional level, the WALK project suggests one way forward towards collaboration, collection development, and researcher access. After our presentation, we would like to facilitate a discussion on suggestions for the project, as well as possibilities for the model to work in other national environments.
University at Albany, SUNY
Automating access to web archives with APIs and ArchivesSpace
Web Archives are records too! The University at Albany, SUNY has been preserving the Albany.edu domain since 2012 to retain permanent public records that are now only created and disseminated on the web. Yet, until recently researchers in our reading room or on our website could access permanent records series that date as far back as the 1840s, but seemed to stop in the 2010s. Now, using open APIs from both Archive-It and ArchivesSpace, our web archives collections are made accessible to researchers together with their paper predecessors at scale.
This presentation will showcase how UAlbany is automating the creation and management of web archives records in ArchivesSpace using the Archive-It CDX servers and traditional archival description. We will look at what websites captured by crawls and stored in WARCs are records, how they are treated by Describing Archives: A Content Standard (DACS), and discuss the idea of web archives as containers.
The framing of web archives in archival description forces us to address the context of creation, use, and acquisition for these records. I will talk about how we can work to provide some of the provenance documentation users will need to use these collections for scholarly research.
The integration of ArchivesSpace and the Archive-It CDX API allows for efficient and automated workflows. Descriptive records for web archives can now be created and managed by non-technical users in ArchivesSpace and be continually updated via the API. This integration also allows us to leverage our current discovery and delivery tools for web archives. We will take a look at how UAlbany is enabling the discovery of web archives and their description exported from ArchivesSpace through its new public access system which is a mix of Drupal, XTF, and static page generation.
Placing web archives is public finding aids is a step forward, but not an end in itself. The use of ArchivesSpace to maintain descriptive and provenance metadata for web archives also provides opportunities for more advanced access and use. If metadata that describes web archives and their provenance is managed in ArchivesSpace, that data can be exported and reformatted with ArchivesSpace’s open JSON API, and possibly made publically available to support computational uses.
Who, what, when, where, why, WARC: new tools at the Internet Archive
The Internet Archive has been archiving broad portions of the global web for 20 years, via a variety of technologies, strategies, and partnerships. This massive historical web dataset, currently totaling over 13 petabytes, offers unparalleled insight into the activity of web archiving and its resulting collections. A variety of recent efforts have aimed to make this collection richer, more discoverable, more interpretable, and to document the origins, profiles, and evolutions of archived content over time. This presentation will cover a number of new efforts to make the voluminous content in Internet Archive’s web archive more usable and discoverable. Initiatives that will be covered:
– Profiling: As part of internal data mining efforts, Internet Archive has been analyzing the scale, scope, and characteristics of web domains within the Wayback Machine global web crawls and making this information available through APIs.
– Search: An overview of the new Wayback Machine Site Search capabilities along with associated exploration tools.
– Collaborations: Insight into multi-institutional efforts, both big and small, to preserve web content that has no dedicated long-term steward.
– Capture: An overview of Brozzler, the new browser-based crawler, as well as other new tools for the improved collection of web content.
– Use: Overview of both technical tools, such as APIs and datasets, and social tools, such as partnerships with researchers, artists, and technologists, intended to encourage the use of archived web content in scholarship, education, and creative work.
While representing a number of different initiatives, taken together these projects will outline a number of new areas of potential continued innovation and expansion for all web archiving programs and will reassert the primacy of preserving and making accessible the valuable historical materials on the web.
Old Dominion University
Archive what I see now – personal web archiving with WARCS
As part of an NEH-funded grant, we are developing open-source tools to allow non-technical users to locally create and replay their own personal web archives. Our goals are two-fold: 1) to enable users to generate WARC files with tools as simple as the “bookmarking” or “save page as” approaches that they already know, and 2) to enable users to access the archived resources in their browser through one of the available add-ons or through a local version of Wayback. Our innovation is in allowing individuals to “archive what I see now”. The user can create a standard web archive file (“archive”) of the content displayed in the browser (“what I see”) at a particular time (“now”).
We present the following tools:
* WARCreate – A browser extension for Google Chrome that can create a WARC of the currently loaded webpage and save it to the local disk. It can allow a user to archive pages behind authentication or that have been modified after user interaction.
* WAIL (Web Archiving Integration Layer) – A stand-alone application that provides one-click installation and GUI-based configuration of both Heritrix and Wayback on the user’s personal computer.
* Mink – A browser extension for Google Chrome that provides access to archived versions of live webpages. This is an additional Memento client that can be configured to access locally stored WARC files created by WARCreate or other tools. Mink also allows users to request that a webpage be archived by an on-demand archiving service, such as the Internet Archive or archive.is.
With these three tools, a researcher could, in her normal workflow, discover a web resource (using her browser), archive the resource as she saw it (using WARCreate in her browser), and then later index and replay the archived resource (using WAIL). Once the archived resource is indexed, it would also be available for viewing in the researcher’s browser using Mink.
These tools can be used to create local, private, personal collections of archived webpages. When archiving resources behind authentication, no credentials are sent to a third-party; all communication remains between the client and original web server. WARCreate just locally records those interactions and creates a local WARC. Users can choose to share their WARCs through any file sharing platform, which could then be indexed by a collaborator’s Wayback.
Lozana Rossenova, Centre for the Study of the Networked Image & Rhizome
Dragan Espenschied & Ilya Kreymer, Rhizome
Containerized browsers and archive augmentation
We will demonstrate how these environments could be described and queried as part of the context of a recording, as well as ideas for permanently stabilizing and storing them. Additional ideas presented will include discussion of a common API based on Memento that could be used to describe browser environments, as well as the latest features available in Webrecorder for collaboration and augmentation of existing web archives.
Arquivo.pt, Fundação para a Computação Científica Nacional
Arquivo.pt API: enabling automatic analytics over historical web data
Arquivo.pt – the Portuguese Web Archive is a research infrastructure that enables search and access to files archived from the Web since 1996. Satisfying researchers needs so that they take advantage of the preserved historical web data and provided search services is a top priority.
Researchers can humanly search and access preserved information through the publicly available graphical user interfaces. However, some researchers need to perform large-scale processing of historical data and must apply automatic analysis to tackle the large amount of data they must address. For instance, to identify named entities related to a given event among all preserved pages. These research works can also originate the development of web-applications that implement new unforeseen use cases and added-value features for web archives.
Therefore, Arquivo.pt provides Application Programing Interfaces (API) that enable the automatic processing of the preserved historical data to facilitate the analysis and exploitation of historical data and the creation of innovative applications that enhance web archive usage.
The Arquivo.pt web archive enables URL, full-text and advanced search operators over archived content. It supports the Memento API to enable URL search, since it is the most widely used API for interoperability among Web archives, but also an OpenSearch-based API. OpenSearch is a collection of technologies that allow publishing of search results for syndication and aggregation in a standard and accessible format. Arquivo.pt provides an OpenSearch-based API to support automatic search and access over its preserved web resources. This API supports multiple query operators mostly designed for full-text search, such as: searching by terms or phrases, excluding certain terms, defining date ranges or restricting search to certain media types. Nonetheless, it can also be applied for URL search queries, such as finding the archived version closest to a certain timestamp or listing all archived versions for a given URL. The provided search responses are XML-based files (RSS 2.0). Each response contains multiple <item> tags, where each of them matches an archived Web resource and their corresponding details.
The Opensearch-based API has been used, for instance, by Computer Science students to develop innovative web applications that automatically aggregate information from multiple web services and resources about public figures. The Memento API has been used to interoperate with external services, like the Memento Time Travel Portal or oldweb.today, with significant gains on the usage and dissemination of Arquivo.pt. Moreover, it enabled the integration of innovative functions such as the completion of missing elements on preserved pages by gathering them from external web archives.
This presentation will introduce the APIs supported by Arquivo.pt, detailing in particular the Opensearch-based API and illustrate some of their use cases and benefits. It will be useful for researchers interested in temporal web analytics but also to web archivists interested in supporting APIs for their users. The provided APIs are work-in-progress, so obtaining feedback from users and peers is precious for their improvement and consolidation as research tools.
Stanford University Libraries
Lots more LOCKSS for web archiving: boons from the LOCKSS software re-architecture
The LOCKSS Program is one of the longest-running web archiving initiatives, though often not thought of as such, given its initial concern with the archiving of electronic journals and the comparative prominence of its distributed approach to digital preservation. The early inception of the program and its distinct content focus led over time to a divergence between the technologies if not the approaches used by LOCKSS relative to the web archiving mainstream. This has resulted in a monolithic LOCKSS software architecture, missed opportunities for the application of LOCKSS innovations outside of its historical domain, and increasingly duplicative engineering efforts applied to common challenges.
The LOCKSS software is now in the midst of a multi-year re-architecture effort that should redress these longstanding issues, to the benefit of both the LOCKSS Program and the web archiving community. By aligning with evolved best practices in web archiving and leveraging open-source software from the web archiving community and beyond, the LOCKSS Program will be able to do more, and more efficiently, and concomitantly help to bolster community stewardship and advance the state-of-the-art for the common web archiving tool stack. This approach reflects a recognition that the sustainability of the tools that enable both the LOCKSS Program and web archiving more generally depends upon an ongoing, robust community effort.
The fundamental aim of the re-architecture effort is to make the LOCKSS software more maintainable, extensible, and externally reusable. This will be accomplished by using existing open-source software solutions wherever possible and by re-implementing existing LOCKSS components as standalone web services. Examples of applications to be integrated include Heritrix, OpenWayback, Solr, Warcbase, and Warcprox. Among the LOCKSS-specific features to be made externally reusable are the audit and repair protocol, metadata extraction and querying services to support access to archived web resources via DOI and OpenURL, a new component for indexing WARCs into Solr, and an on-access format migration framework.
This session will highlight how the LOCKSS Program and LOCKSS software are evolving, and what opportunities that may present for the web archiving community.
Jefferson Bailey, Internet Archive & Naomi Dushay, Stanford University Libraries
WASAPI data transfer APIs: specification, project update, and demonstration
The interest and investment in application programming interfaces (APIs) within the web archiving community has continued to grow, with new grant proposals, local re-architecture efforts, and web service-oriented enhancements to existing platforms all taking an API-based approach to development. The enthusiasm for APIs is their potential to improve the interoperability of independently-developed tools throughout the web archiving lifecycle, standardize mechanisms for delivering data to users and systems, and facilitate a more flexible and extensible web archiving tool chain. Now in its second year, the IMLS-funded Web Archiving Systems API (WASAPI) project is keeping these goals in mind in the design and implementation of APIs specifically focused on web archive data transfer. These APIs will facilitate both preservation and use, and the larger project will provide a roadmap and blueprint for additional API development.
Through engagement with the web archiving community and its stakeholders, including sessions at previous IIPC meetings and a survey, the data transfer APIs have been elaborated to satisfy three principal use cases: replicating data from one repository to another, standardizing bulk and derivative data access by researchers, and rationalizing data hand-offs between capture and processing tools. These consultations have also provided insight on candidate APIs for follow-on work, most notably mechanisms for accessing other types of technical metadata associated with web archiving processes.
This panel will share the latest activities and progress on the development of the web archive data transfer APIs as well as serve as a community forum for further discussion on how the web archiving community can continue to advance APIs more broadly. The project team will detail the expected parameters and possible outputs of the draft API, report on the outcomes of an affiliated February U.S. National Symposium on Web Archiving Interoperability at the Internet Archive, and (timing of ongoing implementation efforts allowing) demonstrate the operation of the API for one or more of the aforementioned use cases.
Jack Cushman, Harvard University
Ilya Kreymer, Rhizome
Thinking like a hacker: security issues in web capture and playback
Securing any website requires thinking like a hacker: how could a curious or hostile user misuse features of this website to attack the site or other users?
This talk will show live demonstrations of exploits targeting simplified versions of our own websites, including threats such as access to user cookies, user account manipulation, and access to internal resources. We will also discuss effective strategies used by Perma.cc and Webrecorder to protect against these attacks while still providing resilient high fidelity web archiving services.
Mat Kelly, Old Dominion University
David Dias, Protocol Labs
A collaborative, secure, and private InterPlanetary WayBack web archiving system using IPFS
Personal and private Web archives may contain content otherwise unpreserved by institutional or other public Web archives. Because of this, data redundancy becomes increasingly important. Replication of these Web archives through distributed peer-to-peer dissemination would facilitate the permanence of the preserved contents. Content in these archives, however, may contain sensitive or personally identifiable information that requires the information be encrypted or otherwise protected. To facilitate the permanence of personal Web archives that contain sensitive information, we introduce the ability for personal Web archivists to share their Web archive collections more securely while selectively regulating access to sensitive content.
In previous work we introduced InterPlanetary Wayback (IPWB), which integrates Web archives files (WARCs) with InterPlanetary File System (IPFS) for peer-to-peer distribution and replay. The initial design of IPWB treated all content contained in a WARC equally without regard to the contents contained within.
In this work we cater specifically to private and personal Web archives. We introduce the ability to encrypt Web archive content at the point of dissemination to serve as a first step toward access control of personal and private Web archives. We extend our initial prototype with a more current and thorough evaluation of IPWB to account for speed-ups recently introduced into IPFS.
We re-examine the relevance of de-duplication of personalized Web archive content, previously an advantage of using IPWB over conventional methods of sharing for WARC dissemination. We also establish a framework for temporally integrating captures from both the private and public live Web as available via IPWB. By providing the facility to integrate captures of private and personal Web archives with public Web archives, we hope to seed the basis for providing a more accurate and complete representation of the Web as it was.
Library of Congress
Oh my, how the archive has grown: Library of Congress challenges and strategies for managing selective harvesting on a domain crawl scale
The Library of Congress has expanded its selective and event-based web archiving program activities in the last few years, increasing 100 TB a year to almost 300 TB a year. In 2016, the Library’s web archives hit a major milestone of a petabyte of data collected since the program began in 2000. With this tremendous growth in a short amount of time, the Web Archiving team has had to employ new strategies for collecting and managing the Library of Congress web archives in recent years.
This talk will provide an update on Library of Congress web archiving activities. We’ll outline some of challenges faced and lessons learned when managing selecting harvesting on a large scale, while continuing to use contract crawling services provided by the Internet Archive. Strategies employed that will be discussed have included new approaches to permissions that have resulted in a refined approach, and necessary changes in our home-grown workflow management tool Digiboard to help simplify processes for staff nominating seeds and managing collections.
With increased flexibility in the how we manage the outsourced crawling, the Library of Congress Web Archiving staff has also been exploring ways to meet increasing desires by curatorial staff for better quality and more comprehensive captures of selected sites — including deeper, longer crawls of very large sites, and RSS feed crawling to capture content that is changing more frequently. We’ll discuss how the Web Archiving team is approaching quality review at scale in new ways to address not only comprehensive captures, but to also meet contract requirements. With limited staff to do visual analysis of the archived content, and with the growth in content being collected, we’re relying more on crawl report data to recognize patterns and issues with the crawl results.
And with any archiving activity at scale, there are additional challenges faced when transferring, storing, and providing access to the Library’s growing web archive collections. We’ll discuss some of the recent post-crawl technical challenges faced as we continue to expand the program, and preserve and manage this large body of content. Additionally, recent opportunities to make Library of Congress web archive data sets available to researchers will be highlighted.
The British Library
The web archive in the Library: collection policy and web archiving – a British Library perspective
As libraries and other memory institutions develop in their understanding and sophistication to collect digital publications, what challenges and opportunities does this present for web archiving? What can we learn from our experience of web archiving? This presentation will look at three related questions, from the perspective of the experience, and also the aspirations, of collection development at the British Library:
- i) Web Archives as a discrete collection within the Library. How do curators and readers understand the “archive”, and how does this compare to understandings of archives and archival practice more generally?
- ii) Web Archives as a “library within a library”. Web archives can be understood, and approached, from the collections that can be derived from them – either along subject lines, or according to different format types (eg web pages, images, recorded sound etc). The concept of ‘special collection’ has been applied in this sense, to leverage the ability of web archivists to rapidly respond to events as they unfold. This raises further questions for how content within a web archive interacts, or fails to interact, with related content and collections held by a library. The relative size and age of a web archive also has implications for how its content is understood alongside related collections.
iii) Web Archive as a methodology for collection management. Web Harvesting is notable as the only technology specifically referred to in The Legal Deposit Libraries (Non-Print Works) Regulations 2013, a document that is otherwise at pains to remain technologically-neutral. Some of the challenges that have faced web archives (eg rapidly changing content, lack of standardisation, very high volume) are now being encountered with digital publishing more generally. At the same time, researcher awareness and understanding of web archives has been growing. Many of the questions that are starting to be put to web archives might also be applied to other digital corpora.
For the British Library, these questions are becoming more urgent as the balance of collecting new publications shifts from print towards digital. Attention, and ambition, is beginning to move from more ‘conventional’ forms of digital publishing to more complex and emerging formats. In this context, the technologies and methods of web archiving, alongside the lessons learned, become more relevant to collection management across types of digital publication. At the same time, there is opportunity for learning to move in the other direction, as curatorial and archival concerns relating to the digital object also inform an understanding of web archives and web archiving.
National Library of the Netherlands
The Web Archive as a Rubik’s Cube: the case of the Dutch National Web
Web archiving is the process of selecting and harvesting data published on the World Wide Web, storing it, ensuring the data is preserved in an archive, and making the collected data available for future research. Web archiving of the national web can be described as the preservation of websites which are matched to a certain geographic, cultural or linguistic location and which are not limited only to the national top level domain. The existence of the Dutch national web started with the publication of the first Dutch .com website in 1993. The Dutch .nl domain expanded soon to 5,66 million national domain names today, which is the fourth biggest national domain on the web. The web archiving efforts in The Netherlands were much more modest. Web archiving at the Dutch National Library began only in 2007. Another major Dutch web archiving project is Archipol, the Dutch political parties web archive which started already in 2000. In time, other web archives were formed, like the Rotterdam city web archive and the Frisian web archive.
In my lecture, I would like to evaluate these past initiatives based on the results of my research at the National Library of the Netherlands from the point of view of a future national web archive. What part of the Dutch web was captured by the various institutions and what not? Where are the biggest gaps in our digital memory? What could be done better in future? All these bigger and smaller Dutch web archives are a part of a common digital heritage, but the Dutch national web archive seems like a distorted configuration of a Rubik’s cube. These web archives were collected with a wide range of selection criteria in mind at different times, archived by using various harvesting techniques, stored in other ways and even vary in their policy what to present and how to present it to the user. In short: as by turning a Rubik’s cube, I will show how to solve at least some sides of a future national web archive in The Netherlands.
Los Alamos National Laboratory
Using the Memento framework to assess content drift in scholarly communication
Scholarly articles increasingly link to so-called web-at-large resources such as project websites, online debates, presentations, blogs, videos, etc. Our research (reported in http://dx.doi.org/10.1371/journal.pone.0115253) found overwhelming evidence for this trend and showed the severity of link rot for such references. Our more recent study (currently under peer-review) provides unprecedented insight into the vast extent of content drift for these references. We speak of content drift when the content of a referenced resource evolves after the publication of the referencing article, in many cases, beyond recognition.
For our study, we extract more than one million URI references to web-at-large resources from three vast corpora with a total of 3.5 million scholarly articles published between 1997 and 2012. From this corpus, we identify all URI references for which Mementos exist that are verifiably representative of the state of the resource at the time the referencing article was published. We discover such representative Mementos in any of the 19 web archives that are covered by the Memento Aggregator. By using various standard text similarity measures, we compare the representative Mementos with their live counterparts on the web (if they still exist) and are therefore able to precisely quantify the extent of content drift in our scholarly articles.
In this presentation we will detail our research methodology and outcomes and make the argument that action needs to be taken to address the issues of link rot and content drift in scholarly communication.
Stanford University Libraries
Understanding legal use cases for web archives
In the broader cataloging of access use cases for web archives, legal use cases are often gestured at but rarely unpacked. How are archived web materials being used as evidence in litigation? What has been the evolving treatment of web archives by courts and litigators? How well understood are the particular affordances and limitations of web archives in legal contexts? What are the relevant rules, precedents, and best practices for authentication of evidence from web archives? How could the web archiving community better support this category of users, and what might be considerations for doing so?
There are by now many examples of the actual application of web archives for legal use cases with which to address these questions. A preliminary literature review suggests a good deal of attention in both cases and law journal articles to issues of authentication but an opportunity for greater education on specific aspects of web archives affecting their reliability. Web archives can (and do) serve as exceptional resources for substantiating claims based on historical, public web content. However, archived websites are far less “self-evident” than many other types of documents that may be used as evidence; their fitness for purpose should ideally entail assessments of:
- canonicality (i.e., is this version of a webpage the same as would have been served to another user?);
- completeness (i.e., how to assess the relevance of content missing from the archive?);
- discreteness (i.e., are efforts being made to ensure that live web content is not leaking into the archival representation?); and
- temporal coherence (i.e., how to assess the reliability of an archival composite of objects collected at varying points in time?).
Drawing in particular on analyses and findings from the U.S. legal context as well as relevant research from the web archiving field, this session will explore trends and opportunities relating to web archives and legal use cases.
Library Innovation Lab, Harvard University
Instruments for web archive comparison in Perma.cc
Web archive comparison is a current area of exploration in Perma.cc. We’re working to expose, to our end users, the differences that develop over time between an archived version of a webpage and the version of a webpage that is served over the live web. Stated plainly, we’re building instruments that help Perma.cc users measure link rot in their archived links.
We’re building two instruments. The first samples a user’s corpus and returns summary statements like “36% of the links you preserved are now significantly different than the live versions of those links.” These summaries remind the user of the importance of web archives and provide a measure of the ephemerality of the web.
The second instrument helps users quickly spot visual differences between two web archives — usually the same URL archived at two different times in Perma.cc. This instrument relies on snapshot images (PNGs) of WARCs, thoughtful UX/UI, and some cleverness with common image manipulation libraries.
During this talk, we will survey available methods of comparing web archives by discussing performance, UX, and implementation details. We will cover technological and communication challenges that are involved in conveying link rot effectively. We will ensure audience comprehension by illustrating how the methods can be woven into a larger web archiving application using Perma.cc as an example.
Alex Thurman, Columbia University Libraries
Helena Byrne, The British Library
Archiving the Rio 2016 Olympics: scaling up IIPC collaborative collection development
The Summer Olympics is the world’s largest international multi-sport athletic competition, with over 11000 athletes from over 200 countries gathering in host city venues that undergo years of frenzied infrastructural preparation. The combination of the athletic events during the games with the broader (often controversial) economic, political, social and environmental aspects of Olympics planning ensure that each Olympiad generates an enormous amount of web content from all over in the world, in many languages, produced by athletes, teams, national federations, fans, and media publications. The buildup to the Rio 2016 Summer Olympics and Paralympics was no exception (cf. 188 million Google search results for “Rio2016”).
Scholarly research from many disciplines (sports, cultural studies, history, urban planning, economics) involving the Olympics continues to flourish, but becomes ever more dependent on web resources; meanwhile the scale and international scope of Olympics web content poses a challenge to archiving as it extends beyond the mandate of any individual memory institution. National libraries have developed web archives devoted to their own nation’s Olympic teams and/or host city experience, but building an Olympics web archive with a multinational perspective requires collaborative collection development, and the International Internet Preservation Consortium (IIPC) has taken on this challenge.
In advance of the Winter Olympics of 2010 and 2014 and the Summer Olympics of 2012, participants from multiple IIPC member institutions contributed website nominations along with some descriptive metadata, and the sites were crawled pro bono by the Internet Archive, but the collections were not publicly accessible. By late 2014 the IIPC had decided to explore a more rigorous approach to collaborative collection development, establishing the Content Development Group (CDG) and subscribing to the Archive-It service allowing past and future joint IIPC collections to be available for public and researcher use.
From June to October 2016 the CDG carried out a project to archive the Rio 2016 Olympics and Paralympics, enlisting contributors from over 18 IIPC member institutions to nominate websites and provide descriptive metadata, resulting in a rich collection featuring content from over 125 countries in 34 languages. Participation exceeded expectations and the very high volume of seed nominations (over 4500 archived seeds and 3TB of data) required mid-project data budget adjustments and much iterative crawling: lessons learned will be implemented in future collections. Next steps involve making researchers aware of this and other IIPC Olympics collections to maximize their use and impact.
This presentation will highlight the challenges and accomplishments of the project, including the following aspects:
- Project planning
- Engagement strategy for enlisting participants (IIPC members and public)
- Collaborative tools and methods for gathering seed nominations and metadata
- Content analysis and seed crawl scoping
- Technical limitations and capture issues
- QA options (via crawl report analysis, via distributed visual analysis)
- Lessons learned–challenge of scaling model of highly curated thematic collection with pre-set data limit to fast-paced high seed volume event
- Publicizing IIPC Olympics collection(s) and outreach to researchers.
Publication Office of the European Union (OP)
Creating a web archive for the EU institutions’ websites: Achievements and challenges
The web archive of the EU institutions contains the websites hosted on the europa.eu domain and subdomains. Its aim is to preserve EU web content in the long term and to keep it accessible for the public. The project, which is still in a pilot phase, is being carried out by the Historical Archives of the European Union and the Publications Office of the European Union, in close cooperation with the institutions participating in the project.
Since the end of 2013, the europa.eu domain and subdomains have been harvested quarterly by Internet Memory Research. The web archive can be consulted online, without any access restrictions.
Following topics will be addressed:
- The preservation process: seed selection, harvesting, quality control, storage and access
- Organization and management: finding the balance between central coordination and input of website owners; between outsourcing operations and in-house activities
- Challenges ahead
- Metadata policy
- Quality control procedures
- Long term preservation and stable formats
- Evolution of tools and techniques, technology watch
- Legal concerns: copyright, data protection
- Access: users’ needs, promotion of the archive
- Collection development: cooperation or links with other web archives, collaborative (event based) harvesting, social media.
At the end of the presentation, participants will be invited to share their questions, thoughts, suggestions and/or similar experiences.
Arquivo.pt, Fundação para a Computação Científica Nacional
Preserving websites of research & development projects
Most Research and Development (R&D) projects rely on their websites to publish valuable information about their activities and achievements, such as software used in experiments, test data sets, gray literature, news or dissemination materials.
However, these sites frequently become inactive after the project ends. For instance, only 7% of the project URLs for the FP4 work programme (1994-1998) were still active in 2015. The deactivation of this websites causes a permanent loss of valuable information to Human knowledge from a societal and scientific perspectives.
Arquivo.pt published a study describing a pragmatic methodology that enables the automatic identification and preservation of R&D project websites. It combines open data sets with free search services so that it can be immediately applied even in contexts with very limited resources available.
The “CORDIS EU research projects under FP7 dataset” provides information about R&D projects funded by the European Union during the FP7 work programme. It is publicly available at the European Union Open Data Portal. However, this dataset is incomplete regarding the project URL information.
We applied our proposed methodology to the FP7 dataset and improved the completeness of the FP7 dataset by 86.6% regarding the project URLs information. Using these 20 429 new project URLs as starting point, we collected and preserved 10 449 947 Web files, fulfilling a total of 1.4 TB of information related to R&D activities.
We will present the main contributions of this study, describing the methodology used and the results obtained identifying and preserving R&D websites, as well as quantitative measurements about the ephemera of EU-funded project websites and their preservation by web archives.
All the outputs from this study are publicly available, including the CORDIS dataset updated with our newly found project URLs.
Peter Webster, Webster Research and Consulting
Chris Fryer, Parliamentary Archives
Jennifer Lynch, Parliamentary Archives
Understanding the users of the Parliamentary Web Archive: a user research project
In 2016 the Houses of Parliament Parliamentary Archives conducted a program of research with the users, present and prospective, of the Parliamentary Web Archive (PWA). The research was carried out by Webster Research and Consulting Ltd. It set out to answer the questions:
– Who are the users of the PWA?
– How did users understand the contents of the archive, and how its parts related to each other?
– What value did they place on different parts of the archive?
– How well did current discovery and access arrangements serve their needs, and what would they like to be able to do, but could not do so at present?
The paper falls into three parts. Firstly, it outlines the background to the project: the history of the PWA so far, and the reasons why the Parliamentary Archives wished to commission the research.
Secondly, it outlines the findings of the research in terms of the success of the current PWA integration with Parliament’s wider discovery services. It also makes some observations about the degree to which users fully understand the material with which they are presented. Also presented are the expressed priorities of users as to the next generation of access tools. While specific to the PWA, each of these has wider implications for the web archiving community at large.
The paper concludes with a reflection on the experience of the Archives in devising and implementing the next steps within a complex organisation.
Emily Maemura, University of Toronto, Nicholas Worby, University of Toronto Libraries, Christoph Becker, University of Toronto & Ian Milligan, University of Waterloo
Origin stories: documentation for web archives provenance
As more researchers are using web archives as sources of data, the validity of their findings relies on an understanding of how the web archive was created. In Big Data, Little Data, No Data, Borgman describes that when researchers use data collected by other people at different times and places, this ‘data distance’ requires formal knowledge representations to communicate important aspects about how the data came to be. For web archives data, researcher may need to understand the technical processes, systems, and computing environments used – but they also need to know the human judgments and decisions that shape the archive. As Andy Jackson notes in a recent post on the UK Web Archive blog, documentation on these decisions is often lacking (http://britishlibrary.typepad.co.uk/webarchive/2015/11/the-provenance-of-web-archives.html).
Our research aims to address this gap. As a collaboration between a librarian, a historian, and researchers of systems for digital curation, we bridge between ‘using’ and ‘creating’ web archives by exploring how information about their origins impacts future use and analysis in research processes, to address questions like:
– What should web archivists document when they create web archives? Which individual curatorial choices made in web crawling and other processes of web archiving need to be documented to later enable a user of the web archives to understand their provenance? How is this done now, how can it be done? Which challenges and gaps arise?
– How do features of web archiving systems influence users of web archives in understanding the origin of the sources they use?
– What do web archive collections reveal about themselves to their users? How can a better understanding and documentation of these features and curatorial choices enable web archives to become more transparent to users?
We draw on a conceptual view of ‘Research Objects’ (http://www.researchobject.org/) to map out the different elements and decisions involved in the process of archiving. Building on a template we developed to study research with web archives, we extend this structure to investigate which aspects of web archives creation are relevant for users. We consider two cases of web archives collections from the University of Toronto Libraries (UTL):
(1) The Canadian Political Parties and Interest Groups Archive-It collection created in 2005 with little formal documentation about the creation process
(2) A newly developed collection, the Global Summitry Web Archive, that has been driven by researchers, with attention to documentation. Created in 2016, the collection was developed by the Munk School of Global Affairs with support from UTL and harvests websites of multilateral summits, meetings and conferences.
We use these cases to illustrate how the documentation of various choices in both the web archive systems and in the curatorial processes have influenced what the resulting web archives reveal to the users. We then discuss the gaps we found in these cases and suggest priorities for future development of web archiving infrastructure in order to open a discussion about how more transparency can be provided through this type of documentation.
Jackie Dooley, OCLC Research, Alexis Antracoli, Princeton University, Karen Stoll Farrell, Indiana University & Deborah Kempe, Frick Art Reference Library
Developing web archiving metadata best practices to meet user needs
OCLC Research established the Web Archiving Metadata Working Group early in 2016 in response to widespread needs expressed across the library and archives web archiving community for best practice guidelines. (URL: oc.lc/wam) The group’s charge is to evaluate existing and emerging approaches to descriptive metadata for archived websites and to recommend best practices to meet user needs in order to improve discoverability and consistency.
Our methodology includes literature reviews focused on the needs and behavior of these users, both those written by and/or focused on web researchers and those emphasizing web archiving metadata needs and practices. We cast a wide net in selecting items to be reviewed, ranging from published articles and survey reports to blog posts and conference notes. From this we have learned about the specific metadata needs expressed by web archives users, including, for example, a variety of content types characterized in the literature as “provenance” data. We abstracted each reading and will publish those abstracts with our reports, together with a full bibliography.
We closely studied descriptive metadata rules from the library and archives communities, as well as local guidelines developed by nine U.S. libraries and archives—all of which revealed vast inconsistency of practice. Finally, we studied eleven web archiving tools to determine the metadata that each produces which might be used to automate metadata production; our tools report will include the evaluation grids.
The work is still very much in progress as of October 2016 but will be completed before the IIPC conference in March 2017. OCLC Research will issue three online, open-access reports: tools, user needs, and metadata best practices. Drafts of each report will be made available for review by members of the web archiving community prior to publication. We have been in touch with IIPC board members and other organizations throughout to ensure that this work does not duplicate other efforts. IIPC will be our first post-publication venue for open discussion on the outcomes of our research.
Netarchive, The Danish Royal Library
Less is more – reduced broad crawls and augmented selective crawls: a new approach to the legal deposit law
In 2005 a revised legal deposit law came into force: the Danish part of the internet became part of legal deposit. The task of implementing the collection of web content was assigned to the two national libraries, the Royal Library in Copenhagen and Statsbiblioteket in Aarhus. Netarchive was born.
- Collecting and archiving the Danish web was done according to the same strategies since 2005:
- 4 broad crawls (in depth going) in two steps each per year (first step with a limit of 10 MB, second step with a limit of 2 GB)
- Selective crawls from about 100 frequently updated sites (mostly News sites) with a variety of frequencies from 6 times a day to monthly and a depth from frontpages only to 4 levels.
- About three event crawls every year ((events, that increase activities on the web, e.g. parliamentary elections or the terror attack in Copenhagen in 2015)
A study on selected broad crawls has shown a certain redundancy: Website owners do not delete much from their sites, they ad content. Time to change strategies? Maybe we will not save much archive space, but reduced broad crawls will stress the web site owner’s servers less than now and the crawls probably will be performed faster.
In 2016 Netarchive decided to modify the selection strategies under the motto: Less broad crawls, more selective crawls: Less is more.
- The new strategies are structured as follows:
- Two broad crawls as before, two broad crawls with a limit of 100 MB,
- Focused crawls of ministries’ and administrative bodies’ websites, very big websites
- Selective and frequent crawls from all Danish national news sites, all regional and local Danish news sites, Special and experimental web sites, organizations and associations, political parties/politicians
- Focused crawls on Social Media
There is an extra gain in the “less is more” strategies. Due to the Danish data protection law, Netarchive is a restricted archive (access only to documented research). Netarchive is working on giving broader access, at least to parts of the archive.
The new organization of crawls will make it much easier to give access to selected domains, e.g. public bodies’ websites.
Our tool NetarchiveSuite allows us to organize crawls in harvest definitions which create harvest jobs. Extractions from the archive can be done on job level. According to the old strategies we had developed a huge number of definitions – websites were included in several different definitions. Impossible to extract jobs from a specific definition for getting the content of a specific domain or website.
The new strategies result in significant fewer harvest definitions and – the crucial point: websites appear in only one harvest definition, consequently only in jobs from one and the same definition.
The National Library of Spain
Building a collaborative Spanish Web Archive and non-print legal deposit
Under the umbrella of the non-print legal deposit legislation in Spain, enacted in October 2015, the regional libraries have been working with the National Library of Spain (BNE) to build a collaborative non-print legal deposit, in order to save and optimize resources. In the Spanish Administration, the regional libraries have their own competencies in legal deposit, so in the environment of the online publications working together in a shared infrastructure turned out the more efficient solution. Thanks to the Bibliothèque nationale de France, the BNE adopted the BCWeb tool for the web curators to manage their own web collections. It was adapted to the BNE infrastructure and put to the service of the regional web curators, that access the tool via the state network, a secure infrastructure that connects all the national and regional institutions in Spain. Using this shared network, the web curators designated by every regional library are managing their own web collections and working together to build the non-print legal deposit. For those publications that cannot be crawled on the web, because they are behind a paywall or under subscription, the same working group of regional web curators propose to the BNE the publications they select to be deposited by the editors or distributors. A workflow to get these publications under the legal deposit is being built. The BNE is designing the access to both web archive and deposited digital publications in a common and safe environment that allows to preserve the copyrights, on one side, and provide access from all the regional libraries, on the other side. Also the quality assurance is afforded by the regional web curators using the BCWeb tool (called CWeb in Spain). Policies and handbooks are also shared with the regional libraries. National and regional web curators hold periodical working group meetings to provide support and guidance. All this work is held under the scope of the “Legal deposit and digital heritage” Working Group of the Library Cooperation Council, where there are representatives of both the National and the regional libraries.”
Karolina Holub, Inge Rudomino, National and University Library in Zagreb & Draženko Celjak, University of Zagreb University Computing Centre (Srce)
A glance at the past, a look at the future: approaches to collecting Croatian web
The National and University Library in Zagreb (NSK), as a memory institution responsible for collecting, cataloguing, archiving and providing access to all types of resources, recognized the significance of collecting and storing online content as part of the NSK’s core activities. In 1997, new Croatian Library Law brought an important amendment in legal deposit provisions stating that deposit libraries were obliged to collect and preserve online publications along with other types of material, as legal deposit. The National and University Library in Zagreb established the Croatian Web Archive (HAW) in 2004, in collaboration with the University of Zagreb University Computing Centre (Srce) and developed a system for capturing and archiving Croatian web resources as part of the legal deposit.
First web harvesting approach was selective and from 2004 to 2010 only selective harvesting of web resources was conducted according to pre-established selection criteria. From identification to archived copy there are several steps and workflow is based on the interaction between ILS and the archiving system on a daily basis. Access is open and all archived content is publicly available and can be searched and browsed via HAW`s web site.
In 2011, in order to enlarge and improve the national collection of archived resources, harvesting of the whole national domain (.hr) started and continues annually using Heritrix. The workflow differs from the selective harvesting because the content is not bibliographically described and there is no interaction between ILS and archiving system.
The analysis of the data from five national domain harvestings will be presented along with the findings on quantity of web resources that disappeared in five consecutive years.
In addition, in 2011 NSK started to run thematic harvestings of content of national importance. By now there are seven thematic collections available to researches.
Mixing and matching all three approaches and tools the Library attempts to cover, to the greatest extent possible, contemporary part of the cultural and scientific heritage. The Croatian Web Archive is a publicly available service with more than 27 TB of archived content and more to come.
School of Advanced Study
Moving into the mainstream: Web archives in the press
This paper will examine how web archives have been covered by the press, particularly newspapers and popular periodicals, in the past decade in the US and the UK. It will explore the types of publication in which web archives are discussed, from titles aimed at those working in higher education (The Chronicle of Higher Education is a notable example), to magazines with an interest in science and technology (for example WIRED), to the tabloid press (including newspapers such as The Mirror and the Daily Mail in the UK).
Drawing on both the live web and web archives it will seek to trace the ways in which discourse about the archived web has changed over time, and how public awareness and understanding have developed as a result. For example, since 2013 there have been several stories (both political and cultural) which have seen coverage of web archives move from the technology sections of newspapers to the main news. In November 2013, The Guardian newspaper ran a story about the apparent deletion from the UK Conservative party’s website of a decade’s worth of political speeches. It introduced the Internet Archive to readers, along with the concept of robots.txt and the restrictions of legal deposit legislation at the British Library.(1) Less than three years later, when The Independent reported on changes to Melania Trump’s website, focusing on disputed claims about her education, the Internet Archive is simply introduced as a research resource, with no further contextual information.(2) Underlying assumptions about the degree of public knowledge about web archives would appear to have changed. The paper will also explore whether debates within Europe about ‘the right to be forgotten’ have influenced media coverage of web archives and consequently raised levels of awareness of digital archiving activity.
Finally, the paper will conclude by discussing how archiving institutions can respond to news stories of this kind and work with journalists to ensure both that they are represented accurately and that opportunities for public engagement are not missed.
- (1) ‘Conservative party deletes archive of speeches from internet’, The Guardian, 13 Nov. 2013 https://www.theguardian.com/politics/2013/nov/13/conservative-party-archive-speeches-internet (accessed 11 Oct. 2016).
- (2) ‘Melania Trump’s website: Nothing to see here, says Donald Trump’s wife’, The Independent, 29 July 2016 http://www.independent.co.uk/news/people/melania-trump-website-disappear-donald-trump-wife-university-degree-a7161676.html (accessed 11 Oct. 2016).
University of Mississippi
Keyword “Katrina”: a deep dive through Hurricane Katrina’s unsearchable archive
Since their inception, social media platforms have been valued as a critical resource for sharing news and information, especially during times of crisis; as such, several strategies for archiving that information are being
put into place. No such strategies were in place for web-based blogs, the precursors to today’s social media, and an analysis of hundreds of blog archives from Hurricane Katrina suggests that an important part of that history has already been lost.
Last year I edited a Hurricane Katrina-related anthology of blog posts, online essays and other ramblings about what was like living through the storm. “Please Forward: How blogging reconnected New Orleans after Katrina” (UNO Press, 2015) featured the writing of more than 75 “citizen journalists” (though most of them didn’t consciously use that term at that time, or even necessarily embrace that role). Much of this collective diary had already disappeared online; only some of it was still accessible via the Wayback Machine. Excavating the entries required a strategy that combined old-fashioned reporting, social media savvy and the software sophistication of the Internet Archive.
Kirkus Reviews called the resulting collection “powerful…a book that preserves testimony that might have disappeared amid the news cycles and Web overflow”; Slate technology writer Amanda Hess said “Please Forward” is “a haunting Internet history”; and Xavier Review recently determined it to be “ambitious and…necessary.”
The contributors to “Please Forward” were incredibly generous in allowing me to resurface and print their reflections from such a difficult part of their lives. I believed that those posts — and the original blogs they were excerpted from – deserved to be preserved and discoverable again in an online context. I worked with Jefferson Bailey’s team at Archive-It to “re-collect” much of the source material I’d mined for the anthology. The collection is still technically under construction, but you can see it here: https://archive-it.org/collections/7625.
I recently conducted a “Curating the Lost Web” workshop at the Society of Professional Journalists’ Excellence in Journalism conference in New Orleans, where I spoke about my research methods and emphasized the opportunities that projects like this present for journalists in particular. I also spoke about this during a lightning round presentation at the recent Dodging the Memory Hole 2016 Conference at UCLA. I would be eager to elaborate on that talk in a full 30-minute presentation at the 2017 IIPC Web Archiving Conference.
University of North Carolina – Chapel Hill
The unending lives of net-based artworks: Web archives, browser emulations, and new conceptual frameworks
Research into net-based artworks from the 1990s to the present is an undertaking divergent from much prior art historical scholarship. While most objects of art history research are stable and discrete analog works, largely in museum collections, net-based artworks are vital and complex entities, existing ‘live’ on artists’ websites, with older versions captured in online web archives like the Internet Archive.
Scholarship on these important artworks benefits by drawing on previous versions preserved in web archives, but utilizing these online resources raises critical methodological challenges. Not only must art history scholars contend with how multiple versions of a work change over time, but they must also address the ever-evolving environment of the web itself. The web of the mid-90s was a radically different environment than the web of today, and yet many historic net-based artworks can still be accessed using modern browsers.
Probing several works by notable net artist Alexei Shulgin as test cases, I investigate the methodological issues that arise when conducting art history research using web archives. I present several methods for analyzing web archival copies of these works, and for comparing these against current versions live on Shulgin’s website. These methods include critically analyzing the HTML code across various web archival copies of works, and using emulated browsers via the oldweb.today platform to recreate how previous users may have experienced the works in an older web environment.
However, these methods must also attend to the evolving and multiple nature of these artworks. The web archival and current copies of Shulgin’s works co-exist with each other not as distinct ‘old’ or ‘new’ versions, but as many facets of a still-living document. From the first, net-based artworks frustrate existing notions of exhibition, collection, curation, conservation, and the boundaries of the artwork. Art historians examining these works today similarly need to adapt dominant theoretical frameworks; to this end, I propose a new such framework for conceiving of these works. Drawing on the archival theory of Wolfgang Ernst and the records continuum model developed by Frank Upward, Sue McKemmish and others, I argue that net-based artworks are plural and heterogeneous documents characterized by dynamic lifecycles, simultaneously traversing archival deep storage and the living web.
I hope to demonstrate that this framework is not only generative of new readings for historic net-based artworks and accommodating of new methods, but can also usefully equip scholars approaching dynamic cultural heritage objects in web archives more broadly.
Francesca Musiani, Institut des sciences de la communication & Valérie Schafer, Centre national de la recherche scientifique
Do web archives have politics?
At the crossroad of two topics of the conference, “Approaches to Web archiving” and “Research methods for studying the archived web”, this presentation will address some of the political issues embedded in Web Archiving. Our paper relies on approaches influenced by Science and Technology Studies, Infrastructure Studies and Media Studies, and relies on case studies selected within our field of expertise (Web archiving governance, Twitter archiving, Web archives of the 90s). This work builds on previous early-stages research presented at the first RESAW conference inviting to open the black boxes of Web archives and Web archiving (Schafer, Musiani & Borelli, 2016).
Revisiting Langdon Winner’ seminal paper, “Do artefacts have politics?” (1980) we assume as the starting point of our argument that “technical things have political qualities”. By “politics”, Langdon Winner meant “arrangements of power and authority in human associations as well as the activities that take place within those arrangements” (Winner: 123). This hypothesis invites us to study how the distributed, diffused and technology-embedded nature (DeNardis, 2014) of Web archiving “can embody specific forms of power and authority” (Winner, 198 121). Observing infrastructures, Web archives design and their stakeholders “coming together” entails looking into the scripts (Akrich, 1992) that perform role-sharing, the distribution of competencies and some delegation of rule enforcement to algorithms and automated devices. As noted by Star: “[…] Study an information system and neglect its standards, wires, and settings, and you miss equally essential aspects of aesthetics, justice, and change.” (1999: 337-339).
Our argument, « Web archives arrangements as form of order », builds on examples such as the diversity of stakeholders, perimeters of crawling and selection of content (and their possible bias regarding e.g. the place of the « amateur culture » or cultural minorities) or emergency collections such as the Paris Attacks collections, to show how Web archives collections might have been designed and built to produce a set of consequences prior to any of its professed uses. It will address the balance between new forms of action and some imposed frameworks (e.g. access to Web archives, design of search tools, interfaces and the vision of Web archives and their uses that they provide; use of metadata, exploration of Web archives thanks to Digital Humanities tools).
- Akrich, M. (1992). The De-scription of Technical Objects. In Bijker, W. & J. Law (eds.), Shaping Technology/Building Society. Studies in Sociotechnical Change, Cambridge, MA: MIT Press, 205-224.
- DeNardis, L. (2014). The Global War for Internet Governance. New Haven, CT: Yale University Press.
- Schafer, V., Musiani, F., & Borelli, M. (2016). Negotiating the Web of the Past: Web archiving, governance and STS. French Journal for Media Research, 6, http://frenchjournalformediaresearch.com/lodel/index.php?id=952
- Star, S. L. (1999). The Ethnography of Infrastructure. American Behavioral Scientist, 43 (3), 377-391.
- Winner, L. (1980). Do artifacts have politics?. Daedalus, 121-136.
University of East Anglia
What can web link analysis reveal about the nature and rise of euroscepticism in the UK?
This proposed paper will utilise Web link analysis to gain a further understanding of the current nature of Euroscepticism in the United Kingdom. In that regard it is hoped that this paper will provide an illustration of how research on using Web archives can be revealing and useful to understanding contemporary political debates.
Britain’s relationship to and subsequent engagement in the process of European integration is one of the most important political, economic and social developments of the last 50 years. It is also highly controversial and heatedly debated as the results of the referendum of earlier this year on leaving the EU has shown. The rise of Eurosceptic ideas and their increasing popularity is now beyond doubt. It is clear that the Web has played a major role as a source of information and as a forum to present and discuss arguments. The question of ‘truth’ and the role of experts was one of the elements of a highly charged debate both in the lead up to the referendum and remains the case now as the country approaches Brexit – whatever form that will eventually be. The goal of gaining greater understanding of how people accessed information and how various sources of information on the Web were interconnected is an interesting and potentially revealing academic question. Were, for example, Eurosceptic websites very much a closed circle linking only to similar sites and thereby amplifying and reconfirming certain perceptions or ‘truths’? Or did they also link to counter-arguments and information, thereby potentially revealing the existence of a more rounded debate and discussion? Utilising the UK Web Archive it is expected and hoped that a link analysis of a range of Eurosceptic websites will illustrate some trends in this regard and hopefully offer some answers to these and other similar questions.
This paper outlines an experimental method for a critical study of web history.
I do so by taking advantage of the mental picture of desolated 90s home pages (unchanged live 90’s home pages in analogy to abandoned architecture) and the critical portrait that documentary movie is drawing of society through storytelling.
Drawing on the theory of grammars of action, the vernacular language is automatically pulled into sight of modern research. Indeed, the greater implication of my experiment is that the intervention into assumptions upon which research into digital culture is conducted, intends to disturb what research and practice take for granted and thereby unconsciously ignore that which is outside the established scope of knowledge. By reexamining the structure of ‘desolated home pages’ in the light of contemporary practice and by acknowledging the limitations of existing assumptions, we can maybe start considering other ways of accessing and categorizing websites and digital culture – broadly speaking.
Shifting the attention from a user-centered history towards the history of web design, the question then is ‘What became of the homepage.gif, the background music, the background image or the visitor count? In refusing to see these dichotomous logics as uncontestably ‘natural’ or ‘obvious’, also more serious questions regarding processes of standardization, consequences of homogenous web aesthetics as well as power struggles of the recent past can be addressed. By foregrounding how the use of seemingly irrelevant and naiive elements can become a global socio-political system, highlights the (un-)consciously slow process of grammar infiltration nowadays that can shape the language of the future web.
Sally Chambers & Peter Mechant, Universiteit Gent, Sophie Vandepontseele & Nadège Isbergue, Royal Library of Belgium
Aanslagen, Attentats, Terroranschläge: Developing a special collection for the academic study of the archived web related to the Brussels terrorist attacks in March 2016
In December 2016 a consortium, led by the Royal Library of Belgium and including institutions such as the National Archives of Belgium and Ghent University in collaboration with DARIAH-BE, was awarded funding from the Belgian Science Policy Office (BELSPO)’s Belgian Research Action through Interdisciplinary Networks (BRAIN) programme for the PROMISE (PReserving Online Multiple Information: towards a Belgian StratEgy) project. PROMISE is a 24-month project that aims to a) identify current best practices in web-archiving, b) investigate how they can be applied in the Belgian context, c) undertake a pilot web-archiving exercise in Belgium, d) experiment with providing access (and use) of the pilot Belgian web archive for scientific research and e) make recommendations for a sustainable web-archiving service for Belgium.
Within the context of the PROMISE project, this short paper will present the initial results of a case study on the experimental development of a special web-archive collection about the terrorist attacks in Brussels on 22 March 2016. The aim of the case study is to: a) design and document the process of creating the special web-archive collection, b) to identify research questions across a range of humanities and social science disciplines that could be asked of the special web-archive collection and c) select appropriate methodologies to answer these research questions. It is intended that this case study would be developed as a partnership between the library and archive professionals in the Royal Library and National Archives of Belgium and a representative group of researchers who are interested in using the special collection.
As Belgium currently does not have a web-archive, the special collection will need to put together from other existing web-archives. For example, on the live web, there are Wikipedia entries in 61 languages about the terrorist attacks in Brussels on 22 March 2016. As a basis for initiating the development of the special collection, we propose to use the Wikipedia pages in the 3 official languages of Belgium plus English that have already been archived in the Internet Archive:
For instance, the Internet Archive has archived the English version of this page 50 times between 22 March 2016 and 14 November 2016, including 15 snapshots on the day of the attacks:
We will also examine two similar special web-archive collections that have been created. Firstly, the special collection in the UK web-archive on the London Terrorist Attack on 7th July 2005 (https://www.webarchive.org.uk/ukwa/collection/100757/page/1). Additionally the special collection that has been created in the context of a project from the National Centre for Scientific Research (CNRS), ASAP (Archives Sauvegarde Attentats Paris) which investigates the digital reaction to the Terrorist attacks in Paris and Saint-Denis in January 2015 (https://asap.hypotheses.org).
Université Sorbonne Nouvelle
The web as a memorial: Real-time commemoration of November 2015 Paris attacks on Twitter
In November 2015, Paris has been subjected to a terrorist attack. People have been showing their solidarity through social media using symbols and specific hashtags.
People also offered their help to those trying to find refuge using the Twitter hashtag #PorteOuverte which has been used nearly 600,000 times. Moreover, the Twitter hashtags #RechercheParis and #SearchParis were used to try finding missing persons and document the progress of the search.
Shortly after the attacks, Twitter became a way to express both grief and solidarity, and to commemorate the victims.
First by the use of hashtags such as #JeSuisParis or #PrayForParis which has been used more than 6 million times according to Topsy.
Second, a number of users including news editors started using Twitter to share information and stories about the victims. Again, specific hashtags like #enmémoire and dedicated accounts were used to memorialize every victim of the attacks. As of January 5, 2016, 130 tweets were posted using @ParisVictims Twitter account.
In this Paper we aim to study the particularity of the web as a forum for commemoration of terror attacks, in particular the use of social media. We propose to explore how social media filter and construct this commemoration in real-time. We will focus on the role of Twitter and on the relation between real-time events and online commemoration: how to pay tribute and express support to victims of terrorism? What role for the archive? What difficulties have we faced with regard to Twitter and data limitations?
For this study, we are using the tools and archives of the Institut national de l’audiovisuel Digital legal deposit (in partnership with the ASAP research project, CNRS) and Twitter data (around 6 million tweets) gathered during November 2015 Paris attacks and 2016 Nice attack (ANR ENEID research team, Post-Mortem Digital Identities and Memorial Uses of the Web, Université Sorbonne Nouvelle, Paris-III). We use natural language processing and text mining (R, Iramuteq) for corpus extraction and to identify co-occurring words and semantic information associated to Twitter hashtags.
- PANG B., Lee. L. (2007). Opinion Mining and Sentiment Analysis. Foundations and Trends. In Information Retrieval 2
- RASTIER F. (2011). La mesure et le grain. Sémantique de corpus, Champion, Paris.
- RATINAUD P. (2009). IRaMuTeQ: Interface de R pour les Analyses Multidimensionnelles de Textes et de Questionnaires.
- RATINAUD P. (2014). Visualisation chronologique des analyses ALCESTE: application à Twitter avec l’exemple du hashtag #mariagepourtous,
London School of Hygiene & Tropical Medicine
Lessons from lessons from failure with the UK Web Archive – the MMR Crisis, 1998-2010
In the early 2000s, British public health was plunged into crisis. The work of Wakefield and colleagues at the Royal Free hospital hinted at a link between the measles-mumps-rubella vaccine (MMR) and autism. After the publication of their article in The Lancet in February 1998, doubt was sown about the safety of the vaccine and childhood immunisation rates fell. While Wakefield’s research practices and ethics violations were exposed in 2004, public health continued to battle those who questioned the need for, and safety of, MMR.
One cannot write this history without internet archives for two reasons. First, the internet is regularly cited by contemporaries and in hindsight as a significant influence on public understanding of the crisis. Historians need to know what was available on the Web and how it evolved over time. Second, many of the URLs provided in print evidence have disappeared from the live Web. Thus, historians need access not just to the resources themselves but also to metadata such as crawl dates so that we can understand (roughly) when information was made available on the Web, when it was modified and when it ceased to be accessible.
The existence of such archives, however, does not solve the fundamental methodological issues of integrating documentary evidence from the historical Web into wider histories. These became clear when this author began to research the MMR crisis as part of a larger project on vaccination policy since the Second World War. The collection of Web data since 1996, coupled with the mass digitisation in recent decades of journals, newspapers and official documents, have left historians with far too much information to be interpreted by a single human. As a result, “traditional” historians are required to re-evaluate how they select and analyse their sources.
This research on MMR made use of several archives, including: the British Library’s SHINE interface; Archive.org; The (UK) National Archives’ archive of government websites; and the publicly available online archive of documents related to MMR on investigative journalist Brian Deer’s personal website. Previous experience working with the British Library and Institute of Historical Research – presented as “Naïve researchers do bad history – Lessons from failure with the UK web archive” at RESAW in 2015 – has made a very daunting project more manageable. Errors and problems have continued to occur, however; an inevitable part of using what is to many historians a “new” technology.
This paper, therefore, will outline how internet archives have enriched a “traditional” history project. In doing so, it makes clear that “histories of the Internet” cannot and should not be kept separate from other forms of contemporary history – and that the historical Web must be embraced as a source of evidence like any other. But it also calls for more training and greater awareness of the advantages and limitations associated with Web archives. Without this, there is a danger that this new well of information will be ignored, misused or rejected by those who do not understand both its importance and its complexity.
The Royal Danish Library
Capturing the web at large – a critique of current web citation practices
The ability to provide precise and persistent references is a cornerstone of good scientific practice, yet for web resources we are often not able to provide a satisfying level of bibliographical information. This means that research could be dismissed for lack of scholarly exactitude, or, ultimately, it might even prompt an unwillingness to include web materials – whether from the live web or web archives – in academic research.
The inadequacies of the prevailing standards for web archive references became evident from two recent research projects designed to challenge the Danish web archive, Netarkivet, in a joint effort by researchers from both computer science and the humanities (as described in the iPRES 2016 paper “Persistent Web References – Best Practices and New Suggestions”). The initial findings led to in-depth study of the extent of the problem of non-persistent and imprecise web referencing in humanistic research.
Through a qualitative as well as a quantitative study of a selection of recent research output, we identify the existing citation practices and ideals regarding web resources among Danish researchers and students. We document a substantial link rot, as well as a significant amount of references in danger of becoming lost due to inherited dependencies that are likely to be non-persistent, and offer a critique of the inconsistencies regarding web references with the most widely used academic publishers. Besides concluding that researchers need to embrace web archives, we discuss the types of information that needs to be embedded in each individual web reference in order for them to reach a level of precision and persistency on a par with traditional references for analogue material.
In conclusion, the presentation will point to an urgent need for new referencing practices as well as an elaboration of what is needed in such a referencing practice.
The British Library
The web archive and the catalogue
The British Library has a long tradition of preserving the heritage of the United Kingdom, and processes for handling and cataloguing print-based media are deeply ingrained in the organisations structure and thinking. However, as an increasing number of government and other publications move towards online-only publication, we are force to revisit these processes and explore what needs to be changed in order to avoid the web archive becoming an massive, isolated silo, poorly integrated with other collection material. We have started this journey by looking at how we collect official documents, like government publications and e-journals. As we are already tasked with archiving UK web publications, the question is not so much ‘how to we collect these documents?’ rather ‘how to we find the documents we’ve already collected?’. Our current methods for combining curatorial expertise with machine-generated metadata will be discussed, leading to an outline of the lessons we have learned. Finally, we will explore how the ability to compare the library’s print catalogue data with the web archive enables us to study the steps institutions and organisations have taken as they have moved online.
The British Library
Resource not in archive: understanding the behaviour, borders and gaps of web archive collections
Since April 2013, the six UK Legal Deposit Libraries have begun to archive the whole of the UK web domain, under the terms of Non-Print Legal Deposit Regulations. In addition to this, the UK Web Archive has archived selected websites on a permission-cleared basis since 2004. Our Web Archive collections contain c. 160TB of data, and billions of individual resources. This represents an incomparable wealth of information for the researcher on the subject of contemporary Britain, a large proportion of it offline and not available elsewhere.
While it is the stated aim of the UK Web Archive to collect the ‘whole of the UK web domain’, and despite the immense amount of data available, there are technical, legal and resource issues which mean that our collection has gaps and limitations. These are important for the researcher to understand when approaching our collections.
This paper will describe Web Archiving as a transformative process, which by necessity converts the dynamic, ephemeral and multi layered online environment into static representations for the purpose of archiving and access.
The paper will address:
Even state-of-the-art web crawlers have technical limitations and are currently unable to capture streaming media, deep web or database content requiring user input, interactive components based on programming scripts or content which requires plug-ins for rendering. This means that certain elements in some of the archived websites are not present.
Web Archiving is carried out under the auspices of the 2013 Regulations which stipulate that the Libraries may only archive web content published in the UK. The UK Web Archive therefore imposes a territorial boundary on its collection which does not exist on the live web. As the process of identifying “UK” web content cannot be fully automated, there are inevitable gaps in the collection.
Defining the parameters of the Crawl:
Web Archiving is expensive. Restrictions are imposed on the amount of data that is crawled from individual web domains. This enables capture of the majority of websites in their entirety, however some larger websites will be incomplete, unless the crawl parameters are overridden on a Targeted basis.
Web Archiving works by taking a snapshot of a website at a particular point in time. We therefore miss some updates to websites and fail to capture websites if they are offline at the time of crawling.
Accessing an ‘instance’ of an archived website requires a reconstruction of all the elements necessary to render the website (html, images, style sheets etc) for playback. The playback software attempts to assemble the elements closest in time to the date of the instance however there can be temporal differences in the elements of an archived website which set it apart from the live version at the time of archiving.
As we cannot, and do not attempt to, capture everything, the resulting Collection is shaped by our collection development policies. The paper will explore how collection development policies and the selection process influence the Web Archive Collection.
University of Bristol
Tracing the virtual community of Hong Kong Britons through the archived web
Historians are well adept at using traditional textual sources to track the existence of communities over time. However, the internet has opened up new forms of communication between groups separated by distance. This paper focuses on the example of the British expatriate community in Hong Kong, between 1980 and 2010, exploring the role online groups and forums play in reinforcing identity within a disparate group.
Coinciding with the rising use of the internet, the 1997 handover of Hong Kong to China saw a significant number of the British community in the territory leave. The ways these leavers continued to interact with Hong Kong, their scattered social groups, and those who remained can be traced through the websites that developed to enable a sense of virtual community.
The archived web provides an ideal tool to track the development of websites over time. It affords an insight into the history of sites that do not maintain a significant archive, allowing changes in size, usership, and focus to be probed. This paper will take a small number of websites as a case study, including: batgung.com, gwulo.com, and hkexpats.com. The continued success of gwulo.com, which provides a platform for crowd-sourced information on Hong Kong’s past, will be focused on as an example of the British community memorialising and processing the colonial legacies of the territory.
The use of web archives offers a new way for scholars to understand scattered communities in the late twentieth and early twenty-first centuries. Despite being dispersed from the physical site of their shared connection, these online groups and sites allow Britons who lived in Hong Kong to maintain a link to former colleagues and friends, and the territory itself, allowing the sense of community to endure.
University of Sussex
The home computer and networked technology: encounters in the Mass Observation Project archive, 1991-2004
The Mass Observation Project (MOP) is a unique national life writing project about everyday life in Britain, capturing the experiences, thoughts and opinions of ‘ordinary’ people. Each year since 1981 the project has issued three ‘Directives’ (open questionnaires) to a panel of hundreds of volunteer writers nationally, known as ‘Observers’. Between 1991 and 2004, observers were periodically asked about the impact of information technology on reading, letter writing, and communications. Our research indicates that the advent of the home PC brought about a number of historically-specific changes in the way people scribed and composed their written communications. The processes through which people turned ideas into text were irreversibly recalibrated by the possibilities of saving, editing, copy and pasting, as well as the practical experience of creating text on screen. As computer resources moved from being predominantly accessible at work to being a staple part of the home, the lines between labour and leisure, business and pleasure and the personal and the professional were blurred, altering in turn the spatial configuration of the home. Ultimately, the advent of the home PC had a significant effect on the processes through which individuals composed a sense of self on a day-to-day basis. It introduced new tensions, possibilities and anxieties to the act of negotiating a ‘modern’ identity.
In the midst of this, networked technologies and the World Wide Web entered workplaces, schools, and homes, and everyday people started to create and consume online the web pages now archived by a range of national and international memory institutions. This paper uses the MOP archives to articulate the rich social, material, and spatial contexts in which these web pages were made and consumed. We argue that the reflexive and yet private nature of responses held in the MOP archive make it an important window into the cultural and social contexts in which networked technologies were encountered. Moreover, they offer an unparalleled insight into how home computing raised new questions about the way the ‘personal’ was categorised at a moment when the lines between public and private were shifting.
We are aware that our proposal is unorthodox. Web archiving is a mature field, and yet the use of web archives by researchers is fragmentary at best. Historians, identified as the primary long-term beneficiaries of web archives, are trained to consult unpublished, personal, organisational, and official archives alongside the published material web archives represent. We argue, therefore, that our paper, the history it seeks to articulate, and source base it uses, is the kind of work crucial to deepening the connections between contemporary historical research and the traditions of the web archiving community. By providing space to discuss the use of both networked and non-networked home computers together and alongside web preservation, web access, and web history, the web archiving community further facilitates the incorporation of web archives into learning, teaching, and research practices.
Brendan Power, Trinity College Dublin & Svenja Kunze, Bodleian Libraries, Oxford University
The 1916 Easter Rising web archive project
The Library of Trinity College Dublin, the University of Dublin, the Bodleian Library, University of Oxford, and the British Library undertook a project to identify, collect and preserve websites that can contribute to an understanding of the causes, course, and consequences of the 1916 Easter Rising. It aimed to preserve social, cultural, political, and educational online resources to enable critical reflection on this seminal event.
The range of websites included in the web archive reflect the varied ways in which the Irish and British states, cultural and educational institutions, as well as communities and individuals, approached the centenary events. The project arose out of a mutual desire to contribute to the 1916 Easter Rising centenary and to explore the possibility of establishing a themed special collection within the UK Web Archive. The Bodleian Library primarily collected UK websites and, since no legislation exists in the Republic of Ireland to ensure that the .ie domain is preserved, The Library of Trinity College Dublin collected websites within the .ie domain on a voluntary basis with the permission of the website owners.
The valuable partnership created between the three institutions on this project, utilising the Bodleian Library’s inclusion in non-print UK legal deposit legislation, the technical infrastructure and expertise of the British Library, and merging these with the wealth of material available in the .ie domain which The Library of Trinity College Dublin made available on a permissions basis, served to deepen and strengthen the historic bonds between the UK Legal Deposit Libraries. Whilst in the first instance the project was designed to produce a web archive collection, resulting in a corpus of 300+ targets, it was also test case for effective collaboration between legal deposit libraries to enable the curation of evolving types of collections, and helped to explore how themed, curated web archive collections can be used to promote the value and potential of web archives to a wider audience. This development of more diverse collaborative networks can facilitate future facing partnerships that are based on our collective commitment to devise innovative approaches to emerging digital opportunities.
This presentation will review the project and outline the problems and opportunities that emerged as the project progressed. In particular, it will highlight the challenges that arose from working across multiple jurisdictions, and the implications of different legislative frameworks for archive curation and collection building, and ultimately, for the content and structure of web archive collections.
The National Library of Ireland
‘Remembering 1916, Recording 2016’: community collecting at the National Library of Ireland
The National Library of Ireland- “Remembering 1916, Recording 2016”.
The National Library of Ireland (NLI) has been archiving the Irish web on a thematic basis since 2011. The National Web Archive reflects our collection development policy and the archived websites echo the political, cultural and creative life of twenty first century Ireland. This presentation will specifically examine the NLI’s work to record commemorations in Ireland in 2016 and how new community engagement projects were undertaken.
Firstly, 2016 marked two important centenaries in Ireland. The 1916 Easter Rising is considered a seminal event in Irish history as it marked the first steps taken towards an independent state. In addition, 2016 saw the continuation of the commemorations of the First World War, with particular emphasis on the centenary of the Battle of the Somme, in which so many Irish men lost their lives. These commemorations were considered an important milestone as they afforded Irish people, both at home and our considerable diaspora abroad, the opportunity to explore the events of 1916 and how they shaped modern Ireland.
The NLI has a long standing tradition of collecting, preserving and making accessible the record of commemorative events, and with this in mind the NLI undertook the largest web archiving project to date, “Remembering 1916, Recording 2016”. The project was a core part of the NLI’s commemorative programme for 2016 and the overall Ireland 2016 commemorations. It resulted in the collection of 455 websites that recorded the commemorations of both the centenary of the 1916 Easter Rising and the Irish involvement in World War One.
The web played a significant role in facilitating the commemorations. This was a unique aspect of the 2016 commemorations that hadn’t been previously encountered. Many interesting online exhibitions were launched by the National Cultural Institutions, schools and Universities. Similarly, many innovative social media accounts were created that documented the 1916 Easter Rising. These websites and social media accounts are vulnerable to loss, as their existence, in some cases, was temporary. This presentation will examine the process of collecting, preserving and making accessible these online commemorations.
How do we do it?: Collection development for Web Archives
Building thematic web archives is fraught with many decisions about what to collect and how to best identify scholarly and cultural production around a particular topic. The literature on collection development for web archives is relatively sparse as much of the research on web archive development has focused on technical aspects of capture, crawling, digital preservation and, to a lesser degree, access. As researchers begin to use web archives more regularly, we recognize the vitally important role that selection and collection development play in the ultimate usefulness and value of the collections that have been built. How do practices used in developing traditional library collections translate into the work of collecting the web, and where do we need to develop new methodologies and tools? How do we maximize the value of collections for research and integrate researcher needs into selection practices? What kinds of information about collecting practices do researchers need to have in order to generate valid and meaningful findings? Are there particular needs that a researcher has when working with a curated thematic collection, as opposed to larger domain or mass web collections?
This short paper provides an in-depth view of collection development strategies that have been used to develop the Human Rights Web Archive at Columbia University Libraries. This thematic web archive has been in development since 2008 and contains over 700 websites and 14 TB of data as of December, 2016. A review of practices, including curatorial research, user-focused research, and assessment strategies, reveals the multi-faceted approach that can be taken to scoping and defining a web collection. This review also considers the tensions between human curation and automated methods of identifying content, and the ongoing challenges of sharing curatorial practices with end users. This case study is grounded in a review of the existing literature on web archive collection development and the conceptual and practical approaches of current interest. By sharing the experience of the HRWA, we seek to contribute to ongoing dialog and development of effective collection building practices.
The British Library
A comparative analysis of URLs referenced in British publications relating to London 2012 summer Olympic & Paralympic Games
It is over four years since London hosted the Summer Olympic and Paralympic Games, this event was the focal point of numerous publications in the UK from 2003 to 2016. Using the British National Bibliographic (BNB) list, this presentation will take a sample of British research publications and analyse what percentage of these books include websites and social media references in the notes and bibliography section. In addition I will explore how much of this content is still available on the live web, how much of it is available in a web archive and what the most popular domains were. As we are currently in a transition period between moving from mostly print to digital, online content will play a much bigger role in future research projects. It is hoped that this presentation will help highlight the need to ensure that web content referenced is traceable long after the publication date.
SUNY Polytechnic Institute
Blogging on September 11, 2001: demonstrating a toolkit to facilitate scholarly analysis of objects in web archives
This presentation details and demonstrates a toolkit designed to (1) facilitate scholarly analysis of objects in Web archives, (2) distribute results of these analyses within the scholarly community, and (3) create repositories detailing processes and sharing data created and collected during these analyses.
The toolkit functions as an interface and analysis tool by allowing researchers to navigate among selected archived objects and display both indexed (calculated) and researcher-generated data describing archived objects. Objects can analyzed using illustration, annotation, and/or categorization. Screengrabs are associated with archived objects through tagging and linking. Open-ended annotations and categorization of objects (using values from pre-existing and/or on-the-fly categorization and classification schemes) is supported. Datasets can be generated and exported for analysis using external tools, and ingested with full association with archived objects from analyses conducted using external tools.
Scholars can use the toolkit to prepare and distribute narratives describing their analyses. These narratives can contain stable links and citations to archived objects, including generating tables and figures highlighting and linking to archived objects meeting specific analytic criteria. Journal articles and conference presentations can be prepared and shared in standard formats, including support of integrated bibliographic referencing. Finally, the toolkit facilitates preparation for deposit in a repository of scholarly work by integrating all aspects of a project within a single framework, and simplifying the process of determining and selecting components of the research project to be included in the repository.
The presentation concludes with a demonstration of an analysis of archived Web objects. In this analysis (drawn from pages archived in the US Library of Congress September 11 Web Archive), blog posts from the days immediately following the September 11 Terrorist Attacks in the United States are examined and explored using a concordance strategy identifying major themes and shared vocabularies. The presentation highlights the use of the toolkit as an interface to the pages identified as objects of study; as a tool to facilitate the concordance analysis; as a platform to present data about the archived pages and to share narratives describing the results of the analysis; and as comprehensive documentation of the project, including Institutional Review Board submission, raw data, analysis tools and final reports, that could be deposited in an archive of scholarly research.
Università di Bologna & Universität Mannheim
Two different approaches for collecting, analysing and selecting primary sources from web archive collections
At the conference, I intend to present two different ways of collecting, analysing and selecting primary sources, when dealing with web archive materials for contemporary history research. Initially, I will offer an overview on the work I’ve conducted during my PhD at the International Centre for the History of Universities and Science (Univ. of Bologna) on identifying, assessing the reliability and selecting primary evidences on the recent past of academic institutions. The methodology I adopted in this work is mainly a combination of practices from traditional historical research (using archival information, conducting oral interviews, examining printed and digital newspapers) and methods from the field of internet studies (assessing the reliability of the Internet Archive, retrieving snapshots from National web archives, etc).
Next, I intend to present a completely different approach for collecting primary sources. At the University of Mannheim, I am currently working on a new solution for creating event-collections from large scale web archives in order to support research on the international relations of the United States.
As a matter of fact, the topic-specific collections that web archives are currently offering (on Archive-it for example) often share crucial limitations: a) they are small in number; b) the selection process is not always transparent; c) they generally offer only documents that are precisely related to an event but lack information on background stories as well as contextual clues. Especially the latter is a crucial issue for humanities and social science analyses. For these reasons, at the University of Mannheim we are currently developing a solution for creating event collections that identifies not only the core documents related to the event itself, but most importantly sub-groups of documents which describe related aspects. We do so, by adopting a combination of methods from the fields of natural language processing (e.g. entity linking) and information retrieval, in order to perform an expansion of the collecting process that is informed by latently relevant concepts and entities from a knowledge base, whose presence in documents is interpreted as one of many indicators of relevance.
Presenting and compare these two different approaches will help me remarking on the importance, for a historian, in learning how to deal in a critical way both with the scarcity and the abundance of born digital sources, when studying the recent past.
Emily Maemura, University of Toronto, Christoph Becker & Ian Milligan, University of Waterloo
Data, process, and results: connecting web archival research elements
The emergence of web archives as key sources in the study of social and cultural phenomena troubles traditional conceptions of archival research. The application of novel computational methods requires a new understanding of how such research projects process web archival material to construct their findings. Three critical factors emerge: the interrogation of sources (understanding how web archives were created to judge their adequacy, appropriateness, and limitations); new computational methods (including which data or workflows can be reused); and the transparency of the research process (findings are dependent on the validity of computational methods and data adequacy). As part of an initiative that develops a stronger framework to understand methods and processes for studying the archived web, this workshop will introduces a framework we have developed to the RESAW community and provide a forum to raise and discuss methodological issues. The Research Object framework (Figure 1) provides a conceptual perspective for describing and analysing the methods used, enabling scholars to document their practices systematically and improve transparency of their methods. Documenting the research process in this way allows for more transparent choices to be made in the process, contributes to a better understanding of the findings and their provenance, and supports possible reuse of data, methods, and workflows. This 90-minute workshop will bring interested attendees together to document their workflows and discuss systematic, transparent workflows.
- It begins with a short, 10-minute introduction to the research object framework;
- Three groups, each led by one organizer, will use about 45 minutes to describe and place one of their own projects into the framework, using pre-printed large scale boards to facilitate this work.
- The organizers will facilitate a directed conversation about the application of methods, including suggestions to be made around standard practices within the web archival research community.
Following the workshop, we will invite the participants to continue the discussion and will share the results openly. The community engagement will provide invaluable ground for furthering the community’s understanding of research methods and developing shared vocabularies and ways of thinking about this new paradigm.
University of Michigan Library
Diving in: strategies for teaching with web archives
While web archiving has made significant gains as a practice, and web archives are being actively mined by researchers, classroom use has so far been limited. This limited use is in part due to the difficulty of navigating web archives, but also to the fact that teaching faculty and librarians have devoted less attention to articulating pedagogical approaches to leveraging web archives in the classroom. This workshop draws on a pedagogy of project-based learning and exploration to model multi-modal approaches to teaching with web archives in ways that encourage students to think critically about images, texts, politics, and culture. This workshop would invite a conversation about teaching with web archives from a broad range of disciplinary approaches, including composition and rhetoric, cultural studies, communications, visual culture, history, and literary studies. The emphasis for participants would be on developing approaches to teaching with web archives from their own disciplinary perspective.
This workshop would have three main elements: in the first section, the workshop leader would lay out a pedagogical framework for teaching with web archives that takes into account diverse learning goals and potential assignments. The major elements of the framework involve considering web archives as objects of analysis, sources of information, models for learning goals, and as sandbox spaces for exploration. In the second section, workshop participants would work through a series of exercises aimed at exploring web archives within the pedagogical framework laid out in the first section. In the final segment, workshop participants would discuss their findings and have an opportunity to gain insights from the other members of the group. The emphasis in this workshop would be on expanding the possibilities for designing assignments and in-class projects in which web archives would offer a space for student exploration and critical analysis.
The rationale for offering this as a workshop, as opposed to a paper, is that creating an opportunity for participants to try out a hands-on exploration of web archives within a pedagogical framework is key to developing new possibilities for teaching with these materials. While the workshop leader brings expertise in teaching and librarianship, the greatest gains from the session will come from diving into web archives collectively and with teaching in mind.
The British Library
The UK Web Archive SHINE dataset as a research tool
I propose a 45-60 minute short presentation and workshop demonstrating how researchers with no previous knowledge of web archives or any programing or specialist computing skills can obtain interesting and meaningful results.
This workshop is specifically aimed at researchers who may not have used web archives in their research and would like an introduction to how useful they can be.
I will give a very short outline of what collections are available as part of the UK Web Archive and outline some of the challenges there are in using them.
I will then work through some case studies of real research of the SHINE dataset and how it has resulted in demonstrable results or taken research in new directions.
There will be an opportunity for participation from the audience with their own questions and research areas.
Tommi Jauhiainen & Heidi Jauhiainen, The University of Helsinki, Petteri Veikkolainen, The National Library of Finland
Language identification for creating national web archives
“The Finno-Ugric Languages and The Internet” –project started at the beginning of 2013 as part of the Kone Foundation Language Programme. It is situated in the University of Helsinki and is part of the international CLARIN cooperation. One of the main tasks of the project is to crawl the internet, using Heritrix, and gather texts written in small Uralic languages. The largest Uralic languages Hungarian, Finnish, and Estonian are out of the scope of the project. As a side project, funded by the Finnish National Library, we have collected the links identified as written in Finnish from outside the .fi-domain, which is the scope of the web archiving crawl done yearly by the National Library itself. The language identification, which is done during the initial crawling, analyzes only a small portion (three 100 character snippets) of a page. Those pages which were initially identified as Finnish were downloaded again and the whole text was analyzed. If the language identified for the whole page was still Finnish, the address of the page was added to the list used to seed the actual archival crawl run by the National Library. The National Library also provided us with the list of links going out from their .fi-domain crawl. We downloaded the pages behind those links and identified their languages as well. Since the first crawls we have improved the identifier software and it can now correctly analyze also multilingual documents. The identifier runs as a separate service, which the crawler calls during the HTML or PDF extracting phase. The language identifier uses a state of the art method developed within the project. We participated in the Discriminating Between Similar Languages shared task in 2015 and 2016. In 2015 our method ranked fourth in the closed track, only surpassed by methods based on support vector machines (SVMs). As of today, there is no language identifier software package available based on SVMs. This could be because all the SVM based language identifiers have used off the shelf SVM tools, which would be difficult to incorporate as one package. In the 2016 shared task our method ranked second in the closed track despite that the number of participating teams increased from 9 to 17. We will publish the language identifier service and the modifications we have made to Heritrix (3.1) as open source before the end of the project (2018).
Los Alamos National Laboratory
Robust links – a proposed solution to reference rot in scholarly communication
With the dynamic character of the web, resources disappear and their content changes frequently. This is problematic, especially when these resources are referenced in scholarly articles where we are used to stable citations. Our research (http://dx.doi.org/10.1371/journal.pone.0115253) confirms that authors increasingly reference such resources, for example, project websites, online debates, presentations, blogs, videos, etc. We therefore increasingly see cases of link rot and content drift – in combination, referred to as reference rot – in scholarly articles.
In this poster, we present robust links (http://robustlinks.mementoweb.org/spec/) – an approach to increase the level of persistence for the scholarly context. It consists of a) archiving referenced web resources utilizing existing web archiving infrastructure and b) using link decoration to convey additional information about a reference. To decorate a link, the URI of the archived copy as well as the archival datetime is added to the original URI of the reference. These three data points are sufficient to make a reference more robust. For example, if, some time after the publication of the referencing article, the referenced resource on the live web is subject to reference rot, a reader can refer to the archived copy. If, in addition, the web archive in which the archival copy was deposited becomes temporarily or permanently unavailable, the reader can still use the original URI and the archival datetime to check – manually or using the Memento infrastructure – for copies in other web archives.
We will demonstrate robust links by the example of a recently published paper (http://dx.doi.org/10.1045/november2015-vandesompel).
Shawn M. Jones, Herbert Van de Sompel, Lyudmila Balakireva, Martin Klein, Harihar Shankar & Michael L. Nelson
Shawn M. Jones, Herbert Van de Sompel, Lyudmila Balakireva, Martin Klein, Harihar Shankar, Los Alamos National Laboratory and Michael L. Nelson, Old Dominion University
Uniform access to raw mementos
Web archives crawl the web to create archived versions of web pages, hereafter called mementos. Most web archives augment these mementos when presenting them to the user, often for usability or legal purposes. Additionally, some archives rewrite links to allow navigation within an archive. This way the end user can visit other pages within the same archive from the same time period.
In many cases, access to the original, unaltered “raw” content is needed. For example, projects like oldweb.today and the Reconstruct feature available at timetravel.mementoweb.org require raw mementos to replay the original content as it existed at the time of capture, without any archive-specific augmentations. In addition, various research studies require the original HTTP response headers of the mementos.
Currently, this raw content is available at many archives using special URIs. In order to acquire the unaltered content a client must know with which archive it is communicating and which special URIs to use. This can create problems when an archive changes software or configuration, requiring such clients to be updated.
We seek to eliminate the need for these archive-specific or software-specific heuristics. We propose a uniform way to request these raw mementos, regardless of web archive software or configuration. Our proposal uses the Prefer HTTP request header and the Preference-Applied response header from RFC7240.
For example, a client interested in the original content at the time of capture would issue an HTTP request to the desired memento URI with the Prefer header set to the value of “original-content”. If the archive can satisfy this request, it will return a memento containing the original content in the response and denote this by including the value “original-content” in the Preference-Applied header of its response.
Our poster illustrates the proposed approach by means of an example and provides a list of raw-ness preferences that are currently being discussed by the community.
Frick Art Reference Library
NYARC discovery: promoting integrated access to web archive collections
NYARC Discovery (http://discovery.nyarc.org) is a research tool from The New York Art Resources Consortium (NYARC), consisting of the research libraries of the Brooklyn Museum, the Frick Art Reference Library of The Frick Collection, and The Museum of Modern Art. NYARC Discovery was launched earlier this year and continues to be developed to unite scholarly resources into a single search environment. NYARC Discovery was built on ExLibris Primo technology and was made possible through the generous support of a grant from The Andrew W. Mellon Foundation. Researchers can now discover NYARC’s Archive-It collections, along with books and e-books, journal titles, periodicals, auction catalogs, traditional archives, photoarchives, dissertations, images, and other freely available electronic resources, in an integrated search interface.
NYARC’s web archives include the consortium’s institutional website collections and six thematic collections pertaining to art and art history: art resources, artists’ websites, auction houses, catalogues raisonnés, New York City galleries and art dealers, and websites related to restitution scholarship for lost or looted art.
A search in NYARC Discovery defaults to include a single full-text search result from the NYARC web archive collections. Links to the full web archives result set are provided for further exploration on the Archive-It interface (www.nyarc.org/webarchive). Archived websites additionally receive individual catalog records in Discovery, Arcade (the online consortial library catalog for the three NYARC libraries), and OCLC’s Worldcat database, with records providing links to both the live and archived versions of the sites.
The next two phases of investigation for web archive collection content in NYARC Discovery will pertain to researcher usage and expanded web archive content offerings. NYARC will test the potential integration of additional web archive collections into the NYARC Discovery single search environment, as well as design and conduct usability studies to assess researcher access of the web archive collection content being made available via this new research tool.