11:00 – 11:20

Mark Phillips: Leveraging machine learning to extract content-rich publications from web archives

Mark Phillips, University of North Texas Libraries
Cornelia Caragea, University of Illinois at Chicago
Krutarth Patel, Kansas State University
Nathan Fox, University of North Texas

The University of North Texas (UNT) Libraries in partnership with the University of Illinois at Chicago were awarded a National Leadership Grant (IMLS:LG-71-17-0202-17) from the Institute of Museum and Library Services (IMLS) to research the efficacy of using machine-learning algorithms to identify and extract content-rich publications contained in web archives.

With the increase of institutions that are collection web-published content into web archives, there has been growing interest in mining these web archives to extract publications or documents that align with existing collections or collection development policies. These identified publications could then be integrated into existing digital library collections where they would become first-order digital objects instead of content accessible only to discovery by traversing the web archive or though a well crafted full text search. This project is focusing on the first piece of this workflow, to identify the publications that exist and separate them from content that does not align with existing collections.

To operationalize this research, the project is focusing on three primary use cases, including: extracting scholarly publications for an institutional repository from a university domain’s web archive (unt.edu domain), extracting state documents from a state-level domain crawl (texas.gov domain crawl), and extracting technical reports from the web presence of a federal agency (usda.gov from End of Term 2008 web archive).

This project is separated into two phases. The first is increasing our understanding of the workflows, practices, and selection criteria of librarians and archivists through ethnographic-based observations and interviews. The research from this first phase informs the second where we are using novel machine learning techniques to identify content-rich publications collected in existing web archives.

For the first phase of research we have identified and interviewed individuals who have worked to collect publications from the web. We have worked to have a representative group of collection types that align with our three use cases of institutional repository, state publications, and federal documents. Our interviews and subsequent analysis have helped to better understand the mindset of these selectors as well as identify potential features that we can experiment with in our machine learning models as the project continues to move forward.

The machine learning phase of research has focused on building a pipeline to run experiments over sharable datasets created by the project team that cover the three use cases. A number of experiments with both traditional machine learning approaches as well as newer deep learning and neural networks have been conducted and early results have identified areas where we can focus to improve our overall accuracy in predicting publications that should be reviewed for inclusion in existing collections of web published documents.

We hope the findings of this work will guide future work that will empower libraries and archives to leverage the rich web archives they’ve been collecting to provide better access to publications and documents embedded in the web archives.

11:20 – 11:40

Jefferson Bailey & Maria Praetzellis: From Open Access to Perpetual Access: Archiving Web-Published Scholarship

Internet Archive

In 2018, the Internet Archive undertook a large-scale project to build as complete a collection as possible of open scholarly outputs published on the web, as well as improve the discoverability and accessibility of scholarly works archived as part of past global and domain scale web harvests. This project involved a number of areas of work: target harvests of known open access publications, archiving and processing related identifier and registry services (CrossRef, ISSN, DOAJ, ORCID, etc), partnerships and joint services with projects working in a similar domain (Unpaywall, CORE, Semantic Scholar), and development of machine learning approaches and training sets for identifying scholarly work in historical domain and global scale web collections. The project also identifies and archives associated research outputs such as blogs, datasets, code repos, and other affiliated research objects. Technical development has included new crawling approaches, system and API development, near-duplicate analysis tools, and other

Project leads will talk about their work on web harvesting indexing, access, the role of artificial intelligence and machine learning in these projects, joint service provisioning, and their collaborative work and partnership development with libraries, publishers, and non-profit organizations furthering the open infrastructure movement. The project will demonstrate how adding automation to the already highly automated systems for archiving the web at scale can help address the need to preserve at-risk open access scholarly outputs. Instead of specialized curation and ingest systems, the project has worked to identify the scholarly content already collected in general web collections, both those of the Internet Archive and collaborating partners, and implementing automated systems to ensure at-risk scholarly outputs on the web are well-collected and are associated with the appropriate metadata. Conceptually, the project demonstrates that the scalability and technology of “web archiving” can facilitate automated content ingest and deposit strategies for specific types or domains of resources (in this case scholarly publishing, but also datasets, nanopublications, audio-video, or other non-documentary resources) that have traditionally been collected via more bespoke and manual workflows. Repositioning web collecting as an extensible and default technical approach to acquisition for all types of content has the potential to reframe the practice of web archiving as crucial to all areas of digital library and archive activities.

11:40 – 12:00

Fernando Melo: Searching images from the Past with Arquivo.pt

Arquivo.pt – Fundação para a Ciência e a Tecnologia

Arquivo.pt is a research infrastructure that enables search and access to information preserved from the Web since 1996. On the 27th of December 2018, Arquivo.pt made publicly available an experimental image search prototype (https://arquivo.pt/images.jsp?l=en).

This presentation will consist on a brief overview of the workflow of the Arquivo.pt image search system followed by a demo and presentation of initial usage statistics. Arquivo.pt image search enables users to input a text query and receive a set of image results that were embedded in web-archived pages.

The workflow of Arquivo.pt image search consists of three main steps, namely;

  1. Image extraction from ARC/WARC files;
  2. Image classification;
  3. SOLR indexing.

On step 1 images are extracted from ARC/WARC files. The input is a set of ARC/WARC files and the output is a set of JSON image indexes.

Each image index has information about a specific image, such as its source URL, title, crawl timestamp or dimensions in pixels, and also information about the page where the image was embedded, such as its URL, page timestamp or page title. Arquivo.pt is using an Hadoop 3 cluster and a Mongodb sharded cluster in order to process large collections of ARC/WARC files and store image indexes in a database.

On step 2 the images retrieved are passed to a GPU cluster, and automatically classified as being safe for work in a scale between 0.0 to 1.0 using neural networks. The input is the set of JSON image indexes, from step 1, and the output is again a set of JSON image indexes, with an added safe field to each image index. All images with a safe score smaller than 0.5 are considered to be Not Safe for Work and are hidden in default image searches.

On step 3 the JSON image indexes obtained from step 2 are indexed using a Apache Solr Cloud. The input is a set of JSON image indexes, and the output is a set of Lucene image indexes (used by Solr). Once step 3 is concluded, the new images are automatically searchable using Arquivo.pt image search system.

An API to enable automatic access to the image search prototype is under development and openly available for testing (https://github.com/arquivo/pwa-technologies/wiki/ImageSearch-API-v1-(beta).

12:00 – 12:20

Sara Elshobaky & Youssef Eldakar: Identifying Egyptian Arabic websites using machine learning during a web crawl

Bibliotheca Alexandrina

Identifying Egyptian Arabic websites while crawling is challenging, since most of them are not in the ‘.eg’ domain or are not hosted in Egypt. Generally, a crawl begins with initial seeds of curated URLs from which all possible links are followed recursively. In such crawl, using content language as means for deciding what content to include could lead to crawling Arabic websites that are not Egyptian. This is due to the fact that most Arabic websites use the same Modern Standard Arabic form that all native speakers uniformly understand.

A human curator could distinguish the origin country of the website from the spirit of the home page. Clues for such judgement include the topics discussed, the calendar differences between Levant, Gulf, and others, or term usage. For example, the word “bank” is transliterated as-is in some countries, while the formal Arabic translation is used in other countries.

In the last few years, artificial intelligence and especially machine learning showed great achievements helping machines make better sense of context and meaning of data. Currently, there are different machine learning algorithms that are able to analyze a known training dataset in order to build a model. If the model is well trained and designed, it will be able to provide accurate predictions for any new unseen input.

From that perspective, we worked to enhance the quality of Egyptian crawls using the power of machine learning. We started by collecting a few seed URLs from the ‘.eg’ domain and another set of seed URLs from other Arab country domains (e.g., ‘.sa’, ‘.ly’, ‘.iq’). Home pages were harvested and their HTML content was parsed to extract only their plain text. After different pre-processing and normalization phases, features were extracted from the text based on their TF-IDF (Term Frequency – Inverse Document Frequency). The extracted features and their labels were used to train a linear classifier. The output of this process is a trained model that can be used to identify whether a newly encountered Arabic website is Egyptian or not.

As proof of concept, an initial experiment used the Arabic content of 300 URLs equally divided between being Egyptian or not. From that dataset, 90% of URLs were used for training and 10% for testing. The resulting average F1-score is approximately 84%.

In the future, we plan to increase the training dataset and experiment with alternative machine learning algorithms and parameters to enhance the classification accuracy. In addition, we hope to apply the same method to identifying Egyptian websites of different languages.

12:20 – 12:30

Q&A

Machine learning projects