International and national legislations and other local and institutional policies often have a profound impact on what web content can be archived and made accessible for research use at cultural heritage institutions.
No one member in the International Internet Preservation Consortium is faced with the same challenges and legal situations. IIPC members face different legal frameworks: some are awaiting legislation, others have legislation that covers web archiving, or other legal doctrines such as fair use (in some countries) that permit or mandate web archiving. Many web archiving organizations follow a permissions-based approach, in absence of legislation, or if the legal frameworks are unclear.
Many IIPC organizations are able to (or mandated to) to collect and preserve web content created in their countries and by their countries citizens. Some of the challenges for these organizations include:
- Access to content. Some may only allow researchers to use the archives on the library premises, in some cases because of privacy laws or concerns. There is hope that some may extend the concept of “premises” to include other branches or partner libraries, to allow for broader access.
- Laws covering some types of electronic content do not always extend to websites.
- Identifying what falls within the scope of a domain. Some examples: In France, the .fr is an obvious indicator, but in reality only 1/3 of French websites are on .fr. Fortunately, the French law does not specify that only .fr must be collected, so the library can preserve content on other domains that are produced by French citizens. To find this content, research is conducted to see where the creator of the site is living, and use a written framework that can be shared if challenged about archiving. In Denmark, there are similar practices, archiving content from or about Denmark and designed for a Danish audience.
- Legal deposit law, in some cases, allows institutions to ask for passwords and technical information for subscription content or other material that cannot be collected by automatic harvesting; and/or in some cases publishers can deposit files directly rather than the institution using harvesting. Different access conditions may apply, depending on whether the content was freely accessible online or not.
For IIPC members following a permissions-based approach, challenges include:
- Lack of response from site owners. Members seeking permission reported a 30-50% response rate; it’s not that websites are denying permission. They just aren’t responding to attempts to contact them.
- Patchy, unbalanced collections as a result of permissions not granted.
- Determining whether 3rd party rights need to be secured.
- The tremendous effort required to contact site owners and notify or obtain permission can sometimes overwhelm staff resources.
- Risk assessments and fair use analysis may allow some organizations to do more, however some are hesitant to go down this path and instead take a more cautious approach.
Robots.txt is a file that websites use to provide instructions to crawlers. They are a known Internet convention, but do robots exclusions have any legal meaning?
In web archiving, many organizations respect robots.txt instructions, however doing so can interfere with archiving in a number of ways. Entire sites can be blocked with robots.txt, or specific parts of sites. Sometimes style sheets and images will be blocked, elements that are important when you are trying to document the look and feel of a website. Some IIPC members obey robots.txt except when it comes to inline images and stylesheets, so the website is better represented. Others who are seeking permission bypass the robots.txt so that the sites archived are as complete as possible.
Site owners wishing to be archived should inspect their robots.txt files to ensure that they are preservation-friendly and do not restrict archival crawlers from visiting.
IIPC members are working to share information and provide more legal resources for web archivists, not only within the membership but more broadly on this website.
As a first step, we’ve gathered information about existing legal deposit legislation (and where we know legislation does not exist) in countries represented within IIPC. This is a living document, and we expect it to be updated regularly to include additional information as it is discovered.
Some of the additional ideas we’re working on are:
- Document best practices
- Provide a visual map of where legislation supporting web archiving is occurring, where it is not
- Share permission letters and formal licenses with other members
- Solicit a white paper or other further study on the use and abuse of robots.txt. The IIPC is interested in gathering more data on this topic, particularly for lawyers working on policies issues related to web archiving and the use of robots.txt.
- Document why archivists might not follow robots.txt exclusions, and why IIPC members don’t believe it is being used as a proxy for copyright.