Friday, August 31, 2007

Data decay: even computers forget

In 1971, a young student at the University of Illinois typed the Declaration of Independence and saved it on a Xerox mainframe computer. The student was Michael Hart, and the electronic document was the beginning of Project Gutenberg.

Project Gutenberg’s aim is to create, store, and distribute electronic copies of books that are open domain: that is, works that are out of copyright, or for some other reason do not have copyright attached. From Project Gutenberg or any one of its many affiliate sites, you can download Shakespeare, the Bible, Franz Kafka, or the works of Plato. Works that might not have survived the ravages of time in real libraries can live on in Hart's virtual library.

The project takes its name from the Gutenberg press, developed by Johannes Gutenberg around 1450. Gutenberg's printing machine used moveable type, an innovation that allowed the mass production of printed material, such as bibles, pamphlets, and books.

Thanks to digital media, knowledge is more accessible than ever. With just a few fragments, I was able to find a nursery rhyme from my childhood. The fragment was:
“Hush little baby, don’t say a word;
Momma’s gonna buy you a mockingbird”

I found regional variants and learned that the version I knew was a rare version of the song.
Nothing is forgotten now. No piece of information is out of reach.
At least, that’s how it seems.

But is digital storage really eternal? Will generations a thousand years from now be reading Kafka online, or finding Mockingbird with a Google search (if Google even exists)?

Physical storage ages and decays

Digital storage, on CD or DVD or thumb drive or hard drive, is subject to the same force of nature as old-fashioned storage such as paper: aging, decay, and disintegration. Data is stored on hard drives magnetically: Tiny alignments of magnetic material on the surface of the drive can be switched one way or another, making long strips of magnetic material that are used to store zeros and ones. These zeros and ones, these bits of “binary data,” are the building blocks of all digital storage.
It works like this: in eight bit storage, eight zeros means “Zero”. Seven zeros and a one means “one.” Six zeros, a one, then a zero, means “two.” Etc. You can build it up from there to sophisticated programs, renditions of the Mona Lisa, an0d copies of Franz Kafka’s Metamorphosis.
The problem with magnetic storage is that the earth itself has a huge magnetic field (this is why compasses can find the north pole: they align themselves to the Earth’s magnetic field). Over time, that weak, but ever-present field gently tugs at the magnetic strips, coaxing them all into aligning in the same direction. It always wins in the end, erasing all data. This is why it is always risky passing laptops through x-ray machines at airports. The x-ray machines need to be specially adjusted for such devices. Even then, some degradation occurs.
It is also is why old video cassette tapes and audio music tapes are unreadable: they also rely on magnetic storage. The North Pole and the South Pole wipe them all clean.

Time also erases optical media

But you don’t have to keep all your information on such an unreliable medium. More and more information is now stored with optical media: the CD ROM and the DVD. Optical media uses lasers to read markings in a disc, and are immune to the ravages of the Earth and other magnetic fields. Unfortunately, they don’t last forever either.
Optical media are subject to bleeding ink and degradation due to temperature and decay. Jerome Hartke of Media Science conducted an experiment on the longevity of CDs. Instead of just waiting around for time to erase all, he sped up the aging process by keeping the CDs in storage for 200 hours at a temperature of 85 degrees Celsius (185 degrees Fahrenheit) and 85 percent humidity.
Hartke found considerable damage to the CDs, but it varied from disc to disc. By one measure, half of all discs with recorded information had at least some defect. Unused discs fared worse: over time, the ability of many of these discs to be modified to accept new data was lost.
Hartke noted that this kind of deterioration was not the only threat to CD information, and possibly not even the most important. A major cause of CD deterioration, he said, was “overconfidence in the robust construction and error correction of CD-R media.”
People think that CDs are more reliable, and more durable, than they actually are. And that belief in the toughness and durability of CDs is a bigger risk to CD health than any of the ravages of time.
Why do people have such faith in the CD? In part, it is because the machines that read them have such excellent error-correction technology, protecting us from seeing their imperfections. All discs generate thousands of read errors, due to defects, noise, and high data densities. These errors are caught and corrected using a standard called the Cross-Interleave Reed Solomon Code (CIRC), creating the illusion - for us - of pure, error-free storage. But over time errors accumulate, and cracks appear. The errors eventually reach a critical mass that is no longer readable. This is the fate of every CD.
There are now longer-lived storage products such as the Plasmon UDO (ultra density optical) storage, that lasts for about 50 years. That’s long by digital standards, but is a twinkle in the eye historically. The Rosetta Stone, ancient writings carved into clay, survived several thousand years.

Hardware obsolescence

Even if information survives for centuries, will it be any use?
The fast rate of change of technology has created a whole new problem: obsolete devices. When hardware standards change, the storage items that were used with them become obsolete. When CD music became available, hundreds of millions of vinyl records became obsolete. This didn’t happen overnight: people still had record players to play them, and even now there is a niche market for vinyl playing equipment. But over time, more and more vinyl record players break, or are thrown out. And if the niche market for the players disappears, finding the equipment will become more difficult.
Elvis and the Beatles have successfully made the transition to CD and again to MPEG and iPOD, but what about the Goddards? Jigsaw? Chicken Shack? The Ozark Mountain Daredevils? How much music was ultimately lost forever in the transition?

Vinyl records are a “contact medium:” the needle has to actually come into contact with the record. The needle follows a groove that spirals inward, and tiny bumps in the groove make the needle vibrate at different frequencies. Every time the needle follows that track, the needle gets worn down a little, and so does the record.
A vinyl record can’t be played an infinite number of times. There are recordings all over the world that have not been converted to digital format. Eventually, it will be too late.
At least vinyl records can be played, even without a specialized vinyl record player: it is possible to rig a makeshift player with a pin or regular needle (as the playing needle) and a roll of paper (as the amplifier). By holding the needle to the record and spinning it, a semblance of the original recording can be heard. This is not the case for floppy disks.
In the early days of computing, data was almost universally stored on floppy disks (although some computers, such as my ZX81, used music cassettes). You inserted them into the floppy disk drive, wrote your files to the floppy disk, and took it out. I had piles of them next to my computer.
At first, they were large and were actually floppy: they were soft and flexible. Over time as technology improved, they became smaller and smaller. While 8 inch floppy disks were the standard in 1970, by 1980 smaller ones were available, only 5 ¼ inches in diameter, and by 1990 the standard size was 3 ½ inches across.
Each of these improvements in floppy disk technology meant that new floppy disk reading machines, “disk drives” needed to be built and distributed, and data that was on old formats needed to be transferred to new formats. During changeover periods, computers would often have two drives, one for a larger format and one for a smaller format. This lulled people into complacency about data stored on larger floppy formats. Eventually, they found themselves in a world where the older format was no longer supported, and retrieving the data was, if not impossible, problematic. Today, computers are sold with no floppy drives at all.

Software obsolescence

When it was first released, Visicalc was revolutionary. Designed by was the original spreadsheet. The idea was not patented, and before long, Lotus, Microsoft, and others created their own spreadsheets. Visicorp, the company that made Visicalc, no longer exists, so getting new copies or and update or product support for your existing copy is impossible. Luckily, the fame and widespread use of Visicalc means that modern computers can run a “Visicalc emulator,” a program that simulates Visicalc so perfectly that old files can still be used.
To get a glimpse of the effort needed to access information in old formats, take the case of Terry, who wanted to retrieve 800 poems that were originally written on an Olympia Carrera word processor. The word processor itself- a dedicated machine that only ran word processing software – is long gone. All that remained were the floppy disks, that were written in a format unique to Olympia Carrera. Terry needed to do the following to get the poems:
1) install a floppy drive to read the correct size floppys,
2) install a program called 22 disk;
3) install a program called ANADISK;
4) build a custom diskette specification using ANADISK.
The job was successful, but it shows how much trouble you can be in when the technology moves on.
Formats for other programs, especially custom-designed programs, may be impossible to decode. This is due to the complexity of digital files, which may need arbitrarily sophisticated algorithms to read.

A ticking time bomb

If the specifications for any file format are lost, then all information stored in that format will also be lost. Decrypting a pdf file, for instance, would be impossible without the specifications. Such specs typically run from hundreds to thousands of pages.
The problem of information being lost in old data formats has alarmed the British National Archives. The chief executive of the National Archives, Natalie Ceeney, describes the situation as a “ticking time bomb.”
They have over 580 terabytes of data in unsupported file formats, and have
teamed up with Microsoft to create a range of systems to retrieve the information using emulators. Even when the emulators are built, transferring that amount of data to a new format is a formidable challenge, and will require large scale automated tasks.
The irony of the National Archives teaming up with Microsoft is that companies with proprietary software (such as Microsoft) have made the problem worse. The specifications for proprietary software and file formats are not freely available, because they are the commercial property of the company that owns it. This hinders the ability of others to make compatible applications.
Microsoft often gets the blame for this situation, partly because file formats for old versions of Microsoft software are not always supported by new versions. However IBM, Novell and other companies have done the same thing.
Product support cycles (the time that a company is willing to help you keep the software working) are usually between five and ten years. After that, you’re on your own. In many cases, the software itself disappears, and all that is left are the files. Files that were once considered important enough to save or archive, now with no way of being retrieved.

Drowning in information

But what if all that information was lost anyway? It seems like we are swimming –sometimes drowning- in information today. Digital information may not be more durable, but it is easy to distribute and easy to copy. It is the new revolution.
Perhaps the Gutenberg Project, the British National Archive, archive.org , and institutions like them are the key to immortal knowledge.
That may be the case, but the vast amount of information around us blinds us to the fact that the information is still transient. It is as if we were given a giant newspaper rack to store all our daily newspapers. Rather than throwing them out, we could keep them all in a room at the back of our house. However, newspapers still decay at the same rate.

Still, it is true that it is now easier to keep information alive. Sure, the information has to stay on the run, moving from storage to storage and home to home. Deletion and permanent erasure stalks it, one step behind, waiting for a slip-up. But with diligence, it is possible to keep the data alive. Storage is replaceable and cheap.

Information half-life

Once, people wrote letters to each other. Now they write emails. Let’s say that letters have some kind of mortality function with a half-life of six months. It’s not unreasonable. Most letters are disposed of within a month or two of arrival. A few are treasured for decades. Emails, too, have a mortality function, but their half-life is a lot longer. Excluding spam, I would guess their half-life would be around two years, considering losses due to changing or deleting email accounts, changing computers, forgetting passwords, accidental erasure and clearing of old files. That means that, even though the data will disappear eventually, on average it lives longer.
Information lifetime doesn’t tell the whole story. Part of the increase in information lifetime is the storage of items that would have previously been instantly destroyed, or at least disposed of in the near future. Communication between co-workers and friends was once almost entirely verbal, and therefore transient. Now, as we transfer daily communication to digital media, we leave a trail with more and more of our daily interactions. Our idle morning chit-chat threatens to outlive us.
Before computers, a paperback novel would have a brief moment of glory. Thousands of copies sit in displays in thousands of bookshops, stacked one on the other. Thousands of hands, thousands of readers, thousands of bedroom shelves. One by one, the copies would meet their ultimate fate: they would be broken, or burned or thrown into the trash. They would be shredded by playing children, or dropped in a puddle. Some would be stored in boxes in attics and sheds, eventually being warped by the elements, and rendered unreadable.
Only the few survived: the outstanding works that entered a second printing, and a third. The books that had instant fame, or a cult following. For them, immortality beckoned, but for most, they are gone. The vast majority of books published in the nineteenth century, for example, have completely disappeared.
Project Gutenberg can now grant immortality not just to the chosen few, but to unlimited numbers of works. But is this a good thing? Can we really be the custodians of an ever-increasing store of mediocre information? Perhaps there is merit in letting things die- a sort of information natural selection. Perhaps our wisdom on wikipedia or Yahoo Answers, or in tens of millions of blogs, journals, and newspapers, is best forgotten for the most part.
So we have a curious situation: On the one hand, information is constantly decaying, and is threatened by changes in technology and media storage standards. On the other hand, the world is experiencing an explosion in the total amount of stored information, much of which is of questionable value. From this, two conclusions can be drawn:
1) we need to focus on what information we want to store. We need to decide how important each data stream is, and how long it needs to be stored.
2) For the information that we decide to keep, we need to be diligent about keeping it.

When I say “we”, I am referring to us as individuals, organizations, and as a society. For instance, as individuals, how many emails do we wish to keep? At your current rate, how much email data will you have in twenty years’ time, and how will you manage it? This is a question that is better asked sooner than later.
Plato’s work has survived for two and a half millennia. It is likely to survive just as long again, but if it is buried under a mountain of mediocrity, nobody might bother to read it.



Further reading for the interested:
Media Science: longevity of data
Problem of media obsolescence
Techworld take on the storage issue
A history of Visicalc by its creator
Webmasterworld discussion: digital data a ticking time bomb (the thread that got me thinking about and researching the issue, and eventually led to this article)

privacy policy

Time Etc respects your privacy.
At www.time-etc.com, the privacy of my visitors is important to me. This privacy policy document outlines the types of personal information is received and collected by www.time-etc.com and how it is used.

Log Files
Like many other Web sites, www.time-etc.com makes use of log files. The information inside the log files includes internet protocol ( IP ) addresses, type of browser, Internet Service Provider ( ISP ), date/time stamp, referring/exit pages, and number of clicks.This helps me to analyze trends, administer the site, track user’s movement around the site, and gather demographic information.
This information is not (and cannot) be used to identify you personally. I wouldn't want to do that.
I do want to know how many people are visiting the site, what pages they look at, and what browser they are using. This helps me build a better experience for my visitors and users.

Cookies and Web Beacons
www.time-etc.com does use cookies to store information about visitors preferences, record user-specific information on which pages the user access or visit, customize Web page content based on visitors browser type or other information that the visitor sends via their browser.

DoubleClick DART Cookie
.:: Google, as a third party vendor, uses cookies to serve ads on your site.
.:: Google's use of the DART cookie enables it to serve ads to your users based on their visit to your sites and other sites on the Internet.
.:: Users may opt out of the use of the DART cookie by visiting the Google ad and content network privacy policy at the following URL - http://www.google.com/privacy_ads.html

Some of my advertising partners may use cookies and web beacons on my site. my advertising partners include ....
Google Adsense
Amazon

These third-party ad servers or ad networks use technology to the advertisements and links that appear on www.time-etc.com send directly to your browsers. They automatically receive your IP address when this occurs. Other technologies ( such as cookies, JavaScript, or Web Beacons ) may also be used by the third-party ad networks to measure the effectiveness of their advertisements and / or to personalize the advertising content that you see.

www.time-etc.com has no access to or control over these cookies that are used by third-party advertisers.

You should consult the respective privacy policies of these third-party ad servers for more detailed information on their practices as well as for instructions about how to opt-out of certain practices. www.time-etc.com's privacy policy does not apply to, and I cannot control the activities of, such other advertisers or web sites.

If you wish to disable cookies, you may do so through your individual browser options. More detailed information about cookie management with specific web browsers can be found at the browsers' respective websites.