Game Walkthroughs and Web Archiving

Project lead: Michael L. Nelson, Old Dominion University Department of Computer Science

Project partners: Los Alamos National Laboratory Research Library

 

Funding:
10,000 USD

Resources:

 

 


Brief description of the project

The prevailing idiom for accessing archived web pages is to operate on a single URL: either going to a web archive itself (e.g., web.archive.org, arquivo.pt) and entering a URL to lookup, or following a link to an archived web page (e.g., references in Wikipedia frequently link to web archives). But this is not how users experience the web: actual people experience web sessions, a series of URLs generated by interaction with the site itself. Many of these URLs, in some combination: are complex and opaque (e.g., Google Maps, Amazon items), do not update based on Javascript interaction with the site (e.g., Twitter timelines, Google Docs), personalized (e.g., Facebook & Instagram timelines), are deep links within a site despite users considering it a top-level destination (e.g., GitHub repositories, scholarly papers in a digital library). We will explore a proof-of-concept of using gaming concepts and platforms to inform web archiving. This will involve creating game walkthroughs of automated & human-driven web archiving sessions to capture the look and feel of sites and the sessions on those sites, and integrating the walkthroughs in playback systems such as pywb. We will also explore broadcasting browser-based crawling sessions (e.g., Webrecorder/Conifer, Brozzler) via gaming platforms like Twitch, Youtube Gaming, and Facebook Gaming.

Goals, outcomes and, deliverables

The goal of this project is to explore possible synergy between gaming concepts, platforms, and technologies and those of web archiving. The idea of game walkthroughs for web archiving was first explored in 2013 (“Game Walkthroughs As A Metaphor for Web Preservation“), but only recently has there been a confluence of enabling technologies: browser-based web archiving and gaming as streaming entertainment. We do not know for sure that this synergy can result in new development directions for the web archiving community, but we are asking for support of a small exploratory project to explore this possibility. We will deliver code, demos, and lessons learned to support:

  • Indexing a series of URLs in pywb that correspond to the appropriate segments in a video
  • Supporting archiving as entertainment: streaming automated and/or human-driven Webrecorder/Conifer and Brozzler sessions through game streaming platforms (e.g., Twitch, Youtube Gaming, Facebook).

How the project furthers the IIPC strategic plan

We will explore a new approach to web archiving, which will support the IIPC’s mission of “developing and sharing tools as well as best practices for Internet preservation.” The goals that this project will support are: A1: “identify and develop best practices…”, B1 “provide a forum for the sharing…”, B2 “develop and recommend standards”, B3 “facilitate the development…”, and B4 “raise awareness…”.

Regarding the strategic priorities, this will support “Maintain and develop tools” via exploring integration of existing tools (e.g., pywb, Webrecorder/Conifer, Brozzler) with popular gaming platforms. This project will also support “Promote Usage of Web Archives” and “Advocate for Web Preservation” via making web archiving 1) session-based, and 2) make web archiving more visible to the public, aligning it with entertainment trends. Web archiving may never be as popular as Minecraft or Grand Theft Auto, but “citizen web archiving” should support passive observation of archiving as entertainment as well as supporting individual capability for web archiving. We should be able to observe “archiving superstars” (both individual and organizational) do their work, as well as save, index, and share their results of their work.

Detailed description of the project

Work phases: We anticipate three main phases:

Phase 1: Evaluate the existing range of browser-based capture tools by establishing both automated and human-driven web capture sessions generating standard WARC files. The goal is not fast headless browsing, but rather regular browsing that can be observed and broadcast, producing video as a result. We will establish a ground truth data set of pages to be crawled that are both challenging to crawl (i.e., requires a browser to render all Javascript), potentially interesting to observe (e.g., Instagram, Facebook), and have logins, geoip, and other considerations that impact personalization of the returned representations. This data set will likely include archived pages as well, so we can reasonably repeat the crawling process in a way that the more interesting but more variable live web crawling can’t do.

Phase 2: We will establish channels on Twitch, Youtube Gaming, Facebook Gaming, and any other relevant platforms for broadcasting content to a wide audience. We already have some experience with these platforms, but as of yet have not tried integration of web crawlers / archiving tools with the platforms. We will schedule and announce test crawls of our data set to our community (though they will be broadly available to all).

Phase 3: Once we have videos of our capture sessions, along with all the constituent URLs that compose a session, we will work on indexing these URLs into open source replay engines, such as pywb. Encoding these URLs in WARC files for easy indexing is a likely approach, but we will test and evaluate other methods as well. We will explore representing the URLs not as full URLs, but as URI keys as introduced in the MementoMap project (which derives from an IIPC funded project), so the representations can be compressed and multitudes of URLs can be considered a “hit” when doing a URL lookup in pywb.

Measurement & evaluation: We envision this strictly as a proof of concept: can we do this and does the result look promising for further investigation? Will the IIPC community be intrigued when we share the results? Will anyone watch the channel(s) once they’re set up? We are aiming for the novelty of the approach to generate a “buzz” around the concept, but success will not depend on achieving a certain number of viewers or subscribers. We will focus on implementing the capabilities, documenting how others can set up their own archiving channels on the gaming platforms, and sharing our lessons learned at the IIPC General Assembly and other relevant conferences, publications, and social media.

Technologies: Relevant technologies will likely include: pywb, Webrecorder/Conifer, Brozzler, Squidwarc, Twitch, Youtube Gaming, Facebook Gaming. We may discover other technologies as we progress. We anticipate all technologies will be free to use (platforms like Twitch) and the associated tools will have an open source license. All tools, scripts, utilities, etc. that we develop will be released from our GitHub as open source.

Communications: Michele Weigle and Michael Nelson are already in daily communication on a number of teaching and research issues. They are already co-advising the identified student and meet at least weekly to cover progress. Michele and Michael work with Martin on many existing projects and communicate several times a week. This will be a candidate Ph.D. research topic for Travis Reid, so the work on this will be covered as part of our regular advising. We will coordinate with Martin / LANL to explore issues of off-site (i.e., not odu.edu or regional ISPs) crawling and archiving. For example, exploring differences in geoip while crawling and archiving sites (Norfolk, VA vs. Santa Fe, NM). We’re not sure that LANL has an interest in gaming per se, but we do know that they have an interest in preserving web archiving sessions that capture the look and feel of sites that might otherwise be lost when replaying.

Risks: This is an inherently risky project, and this is why we are asking for a “seed grant”, the lowest level of grant possible: it is exploratory and might not work. The integration of indexing URLs pointing to videos via pywb might not go smoothly and URL discovery could remain a problem. There could be unforeseen problems integrating Webrecorder et al. with Twitch et al. We have a student (Travis Reid) assigned to the project who is an active gamer and a PhD student studying web archiving, so we have experience in both camps. On the other hand, it is possible that one or more of the platforms will not want our content on their platforms, perhaps because it is a boring “game”, they don’t understand copyright, or some of the pages that end up being crawled violate their terms of service (porn, hate speech, etc.). We hope to mitigate risks of offensive pages with our ground truth data set.

Expected impact: If it works, then we’re on to something. We do not have to become a popular “gamer” on these platforms for this project to be a success, we only need to prove that we, the web archiving community, can do our work in a publicly observable space in a manner that engages audiences that might not otherwise be aware of the Wayback Machine, pywb, Webrecorder, Heritrix, Brozzler, WARC files, etc. But they are familiar with walkthroughs, Youtube, etc., and they do have a nuanced sense of what it means to faithfully preserve an online experience. The web archiving community has long had marginal support for screenshots as page captures (e.g., an example page as screenshots in both archive.is and web.archive.org), but these do not capture the dynamism of a session. A successful project will allow us to better understand the issues involved with gaming + archiving integration, go after larger funding from the NSF or IMLS, and energize the web archiving community by expanding web replay capability to include web site walkthroughs.

Project schedule of completion

Month 2: Complete the ground truth data set of URLs to crawl & archive; evaluate which archiving software we will use.

Month 4: Finish evaluation of game streaming platforms to use, documenting how easily they integrate with the archiving software, lessons learned in establishing a stream, downloading the resulting video, explore retention policies on the platforms (e.g., some are popularity based and some allow for indefinite retention). Perform private streaming sessions.

Month 6: Begin public streaming sessions, including “dueling” sessions from ODU and LANL. Begin by gathering feedback form IIPC and other sympathetic audiences, and if successful advertise more broadly to generate “buzz”.

Month 9: Implement and evaluate prototypes for how the crawled URLs (e.g., from Google Maps or Twitter) correspond with time offsets in the resulting session videos, and how they can complement discovery and replay of archived pages in the conventional Wayback Machine idiom.

Month 13: After the project has completed, write the final report, submit for presentation the IIPC GA, share code, data, etc. via GitHub, generate scholarly publications (e.g., JCDL, iPRES), and with the lessons learned submit a more ambitious funding proposal to the IMLS, NSF, etc.