Archivists Scramble to Save Digital Era

Archivists adjust strategies to confront changing world.

July 10, 2001 — -- Not too long ago, America's culture was recorded mainly on paper, vinyl and film, and deposited at the Library of Congress and the U.S. Copyright Office.

Nowadays, the seeds of future history may be flowing in huge volume from Web designers and content providers directly into peoples' homes, and the Library of Congress and other archivists are trying to create new systems to ensure the information is saved.

It's a big job, they say, but also a big opportunity.

"We are given an opportunity through the digital medium to create libraries and disseminate knowledge in ways never possible before, and if we make the wrong steps, we can lose not only this opportunity, but also our cultural heritage that's in digital form," says Brewster Kahle, founder of the Internet Archive.

More in This Series:

Will Digital Era Mean End of History?

Why Bits Can Byte the Dust

Preservation Can Be Message to Future

Virtual Museum Remembers Dead Web Sites

Kahle's Archive is working with the Library of Congress and private industry to preserve a record of the Internet for future generations, and has been saving parts of the Web for five years. Kahle says he knows of "no good [archived Internet] collection pre-1996" and he feels that's a shame. But, he adds, it is not unprecedented in history.

Planning a Preservation System

Still, archivists don't want to fall behind this time, if they can help it.

In December, Congress appropriated $100 million to the Library of Congress to develop a national program to preserve digital information. The effort is about more than just money, say archiving professionals.

"The first stage of the plan is to make sure we understand the issues related to long-term preservation," says Laura Campbell of the Library of Congress' Digital Infrastructure Program. "What are the considerations to making sure this content isn't lost to future generations?"

To that end, the Library is consulting with prominent publishers of Internet and digital content, attempting to reach agreement on preservation standards and work out deals for archiving information, says Winston Tabb, the associate librarian for library services.

In some cases, the Library is discussing with publishers of copyrighted professional journals, news sites and Webzines, ways that it can assist with archiving their content. Currently, publishers can assert their copyrights, store all the information themselves and forbid local duplication and storage by libraries and archives.

Vulnerable Data?

At the same time, the Library is conducting what may be the first independent scientific tests on the stability of digital storage formats, starting with compact disks, so that it knows how long it can expect information saved in such formats to last.

"We don't see anyone out there doing this kind of study, so it does fill an important niche," says Marc Roosa, director of preservation for the Library of Congress.

Such information is important to the Library, the U.S. Copyright Office, and other libraries with growing and aging collections of music CDs, CD-ROMs, DVDs and data stored on writable CDs, that may need to duplicate their collections to save the data.

Internet History Saved, Lost

Experts say the Internet is an even stronger example of the here-today-gone-tomorrow aspect of the digital age. Preservation efforts must begin now, they say, as archivists often cite conventional wisdom that the average life of a Web page is about two months.

In recent years, Kahle says, the Internet Archive has taken snapshots of all legally accessible parts of the Internet every two months, so that people in the future will have some idea how it looked and what surfing the Web circa 1996 to 2001 was like.

Areas of the Web designated as off limits by site creators, or password protected, are among areas not saved during the Internet Archive's automated sweeps. (Some parts of are forbidden to archivists.)

So far, the archive has collected 40 terrabytes of information, about double the amount of digital space it would take to store the Library of Congress' collection of text-based material, Kahle says. The archive plans to transfer its data to new hard drives every five years so that the information it contains is not lost to decaying digital bits.

The digital material accessible through his site's "way-back machine" includes a two-terrabyte archive of sites related to the 2000 presidential election, frozen as they were and stored periodically as the election campaign and its post-election aftermath progressed. Some pages from the 1996 election also are accessible. Most of the pages from both elections are no longer available elsewhere on the Web in their original form.

However, reconstructing records for the Web gets even harder the farther back in time you go, meaning already there is a loss of historical information.

"If somebody were to try to write a dissertation today about the Web in 1994, say, they would be hard-pressed to find the kind of archival primary materials that they'd want," says Patsy Baudoin, e-journal archiving project manager at the Massachusetts Institute of Technology libraries.