When scientific publications were published in paper format, libraries played an important role in ensuring that knowledge was not lost. Because copies were distributed in so many libraries, there was no risk of information being lost even if a publisher went out of business or a library closed. However, like everything else, scientific content has been digitized and the implications for preservation have also changed.
The organization has devised a system that provides options for storing digital materials. However, recently published research shows that many digital documents are not consistently represented in the archives they are intended to be preserved. And it risks losing academic research, including science paid for with taxpayer money.
Reference tracking
The work was carried out by martin eve, developer of Crossref. This is the organization that organizes the DOI system, which provides permanent guidance to digital documents, including almost all scientific publications. If updated properly, the DOI will always resolve to the document, even if the document is moved to a new URL.
But there are also ways to deal with documents disappearing from their expected locations, which can happen if a publisher goes out of business. There are some so-called “dark archives” that are not available to the public, but should contain a copy of what the DOI has been assigned. If something goes wrong with the DOI, the dark archive is triggered to open access and the DOI is updated to point to the copy in the dark archive.
However, for this to work, all published copies must be archived. So Eve decided to see if that was true.
Using the Crossref database, Eve retrieved a list of over 7 million DOIs to see if the document could be found in the archive. He has published well-known works such as his Internet Archive at archive.org as well as academic works such as LOCKSS (Lots of Copies Keeps Stuff Safe) and her CLOCKSS (Controlled Lots of Copies Keeps Stuff Safe). We have also included those specific to.
Not well preserved
The result…was not very good.
When Eve analyzed the results by publisher, less than 1% of the 204 publishers had the majority of their content stored in multiple archives. (The cutoff was 75 percent of their content being in three or more archives.) Those with at least half of their content in two archives were less than 10 percent. And a full third of her seemed to be doing no systematic archiving at all.
At the individual publication level, less than 60 percent appeared to be present in at least one archive, and more than a quarter did not appear to be present in any archive at all. (The remaining 14% were published too recently to be archived or had incomplete records.)
The good news is that major academic publishers seem to be doing a reasonably good job of getting material into their archives. Most of the unarchived issues come from small publishers.
Eve acknowledges that there are limitations to this research, primarily in that there may be additional archives that he has not checked. There are also notable dark archives that he didn’t have access to, and others like Sci-hub, which infringes the copyrights of commercial publishers’ materials in order to make them available to the public. Finally, individual publishers may have their own archiving systems in place to prevent the loss of publications.
Should I worry?
The risk here is that access to some academic research may eventually be lost. As Eve says, knowledge is expanded by being able to build a foundation of facts that can be traced back through a series of references. As you begin to lose those connections, your foundation becomes even more unstable. Archives come with their own set of challenges, including being expensive, needing to be organized, and having to establish a consistent means of accessing archived materials.
But we have failed to some extent on the first step. “A key point is that there is no consensus on who should be responsible for archiving scholarship in the digital age,” Eve writes.
A somewhat related issue is ensuring that people can find archived materials. This is the problem that DOI was designed to solve. Often a manuscript author will put a copy in a place like her arXiv/bioRxiv or her NIH’s PubMed Centra (this kind of archiving is increasingly mandated by funding bodies). The problem here is that the archived copy may not contain a DOI to ensure location. That doesn’t mean it can’t be identified by other means, but it definitely makes finding the right documentation much more difficult.
In other words, if you can’t find the paper, or aren’t sure if you’re looking at the correct version of the paper, it could be just as bad as not having a copy of the paper at all.
This does not mean that we have already lost important research documents. But Eve’s paper does a valuable job by underscoring that the risks are real. For most academics, the print version of a journal is irrelevant, and we are entering an era where digital-only academic journals are proliferating. It is long past time that clear standards were established to ensure that digital versions of research are as durable as printed copies.
Library and Scholarly Communication Journal, 2024. DOI: 10.31274/jlsc.16288 (About DOI).