WAC 2021: Workshop: SolrWayback 2
Web archive discovery systems and scaling
Thomas Egense & Toke Eskildsen
The Royal Danish Library
This workshop will
- Present challenges for building and maintaining a web archive scale discovery system
- Explain concrete strategies for running Solr at different scales (same strategies should work for Elacticsearch)
- Provide a forum for sharing experiences and problems with the scale of web archives. Bring your own challenges and we will solve them together!
- An interest in the scaling of web archive discovery systems
The Royal Danish Library has been providing full text search and discovery for the Danish Netarchive for several years, lately using SolrWayback. The archive contains 33 billion records, which are all indexed and available online. Solr is used as the underlying search engine and scaling has been both a design criteria and an ongoing challenge.
Indexing (using Web Archive Discovery (https://github.com/ukwa/webarchive-discovery) and searching (using Solr (https://solr.apache.org/) each have their own issues which can easily compound to larger problems: Setups that works well at a certain size is no guarantee for a working system at 10× that size.