Friday, August 31, 2007

Data decay: even computers forget

In 1971, a young student at the University of Illinois typed the Declaration of Independence and saved it on a Xerox mainframe computer. The student was Michael Hart, and the electronic document was the beginning of Project Gutenberg.

Project Gutenberg’s aim is to create, store, and distribute electronic copies of books that are open domain: that is, works that are out of copyright, or for some other reason do not have copyright attached. From Project Gutenberg or any one of its many affiliate sites, you can download Shakespeare, the Bible, Franz Kafka, or the works of Plato. Works that might not have survived the ravages of time in real libraries can live on in Hart's virtual library.

The project takes its name from the Gutenberg press, developed by Johannes Gutenberg around 1450. Gutenberg's printing machine used moveable type, an innovation that allowed the mass production of printed material, such as bibles, pamphlets, and books.

Thanks to digital media, knowledge is more accessible than ever. With just a few fragments, I was able to find a nursery rhyme from my childhood. The fragment was:
“Hush little baby, don’t say a word;
Momma’s gonna buy you a mockingbird”

I found regional variants and learned that the version I knew was a rare version of the song.
Nothing is forgotten now. No piece of information is out of reach.
At least, that’s how it seems.

But is digital storage really eternal? Will generations a thousand years from now be reading Kafka online, or finding Mockingbird with a Google search (if Google even exists)?

Physical storage ages and decays

Digital storage, on CD or DVD or thumb drive or hard drive, is subject to the same force of nature as old-fashioned storage such as paper: aging, decay, and disintegration. Data is stored on hard drives magnetically: Tiny alignments of magnetic material on the surface of the drive can be switched one way or another, making long strips of magnetic material that are used to store zeros and ones. These zeros and ones, these bits of “binary data,” are the building blocks of all digital storage.
It works like this: in eight bit storage, eight zeros means “Zero”. Seven zeros and a one means “one.” Six zeros, a one, then a zero, means “two.” Etc. You can build it up from there to sophisticated programs, renditions of the Mona Lisa, an0d copies of Franz Kafka’s Metamorphosis.
The problem with magnetic storage is that the earth itself has a huge magnetic field (this is why compasses can find the north pole: they align themselves to the Earth’s magnetic field). Over time, that weak, but ever-present field gently tugs at the magnetic strips, coaxing them all into aligning in the same direction. It always wins in the end, erasing all data. This is why it is always risky passing laptops through x-ray machines at airports. The x-ray machines need to be specially adjusted for such devices. Even then, some degradation occurs.
It is also is why old video cassette tapes and audio music tapes are unreadable: they also rely on magnetic storage. The North Pole and the South Pole wipe them all clean.

Time also erases optical media

But you don’t have to keep all your information on such an unreliable medium. More and more information is now stored with optical media: the CD ROM and the DVD. Optical media uses lasers to read markings in a disc, and are immune to the ravages of the Earth and other magnetic fields. Unfortunately, they don’t last forever either.
Optical media are subject to bleeding ink and degradation due to temperature and decay. Jerome Hartke of Media Science conducted an experiment on the longevity of CDs. Instead of just waiting around for time to erase all, he sped up the aging process by keeping the CDs in storage for 200 hours at a temperature of 85 degrees Celsius (185 degrees Fahrenheit) and 85 percent humidity.
Hartke found considerable damage to the CDs, but it varied from disc to disc. By one measure, half of all discs with recorded information had at least some defect. Unused discs fared worse: over time, the ability of many of these discs to be modified to accept new data was lost.
Hartke noted that this kind of deterioration was not the only threat to CD information, and possibly not even the most important. A major cause of CD deterioration, he said, was “overconfidence in the robust construction and error correction of CD-R media.”
People think that CDs are more reliable, and more durable, than they actually are. And that belief in the toughness and durability of CDs is a bigger risk to CD health than any of the ravages of time.
Why do people have such faith in the CD? In part, it is because the machines that read them have such excellent error-correction technology, protecting us from seeing their imperfections. All discs generate thousands of read errors, due to defects, noise, and high data densities. These errors are caught and corrected using a standard called the Cross-Interleave Reed Solomon Code (CIRC), creating the illusion - for us - of pure, error-free storage. But over time errors accumulate, and cracks appear. The errors eventually reach a critical mass that is no longer readable. This is the fate of every CD.
There are now longer-lived storage products such as the Plasmon UDO (ultra density optical) storage, that lasts for about 50 years. That’s long by digital standards, but is a twinkle in the eye historically. The Rosetta Stone, ancient writings carved into clay, survived several thousand years.

Hardware obsolescence

Even if information survives for centuries, will it be any use?
The fast rate of change of technology has created a whole new problem: obsolete devices. When hardware standards change, the storage items that were used with them become obsolete. When CD music became available, hundreds of millions of vinyl records became obsolete. This didn’t happen overnight: people still had record players to play them, and even now there is a niche market for vinyl playing equipment. But over time, more and more vinyl record players break, or are thrown out. And if the niche market for the players disappears, finding the equipment will become more difficult.
Elvis and the Beatles have successfully made the transition to CD and again to MPEG and iPOD, but what about the Goddards? Jigsaw? Chicken Shack? The Ozark Mountain Daredevils? How much music was ultimately lost forever in the transition?

Vinyl records are a “contact medium:” the needle has to actually come into contact with the record. The needle follows a groove that spirals inward, and tiny bumps in the groove make the needle vibrate at different frequencies. Every time the needle follows that track, the needle gets worn down a little, and so does the record.
A vinyl record can’t be played an infinite number of times. There are recordings all over the world that have not been converted to digital format. Eventually, it will be too late.
At least vinyl records can be played, even without a specialized vinyl record player: it is possible to rig a makeshift player with a pin or regular needle (as the playing needle) and a roll of paper (as the amplifier). By holding the needle to the record and spinning it, a semblance of the original recording can be heard. This is not the case for floppy disks.
In the early days of computing, data was almost universally stored on floppy disks (although some computers, such as my ZX81, used music cassettes). You inserted them into the floppy disk drive, wrote your files to the floppy disk, and took it out. I had piles of them next to my computer.
At first, they were large and were actually floppy: they were soft and flexible. Over time as technology improved, they became smaller and smaller. While 8 inch floppy disks were the standard in 1970, by 1980 smaller ones were available, only 5 ¼ inches in diameter, and by 1990 the standard size was 3 ½ inches across.
Each of these improvements in floppy disk technology meant that new floppy disk reading machines, “disk drives” needed to be built and distributed, and data that was on old formats needed to be transferred to new formats. During changeover periods, computers would often have two drives, one for a larger format and one for a smaller format. This lulled people into complacency about data stored on larger floppy formats. Eventually, they found themselves in a world where the older format was no longer supported, and retrieving the data was, if not impossible, problematic. Today, computers are sold with no floppy drives at all.

Software obsolescence

When it was first released, Visicalc was revolutionary. Designed by was the original spreadsheet. The idea was not patented, and before long, Lotus, Microsoft, and others created their own spreadsheets. Visicorp, the company that made Visicalc, no longer exists, so getting new copies or and update or product support for your existing copy is impossible. Luckily, the fame and widespread use of Visicalc means that modern computers can run a “Visicalc emulator,” a program that simulates Visicalc so perfectly that old files can still be used.
To get a glimpse of the effort needed to access information in old formats, take the case of Terry, who wanted to retrieve 800 poems that were originally written on an Olympia Carrera word processor. The word processor itself- a dedicated machine that only ran word processing software – is long gone. All that remained were the floppy disks, that were written in a format unique to Olympia Carrera. Terry needed to do the following to get the poems:
1) install a floppy drive to read the correct size floppys,
2) install a program called 22 disk;
3) install a program called ANADISK;
4) build a custom diskette specification using ANADISK.
The job was successful, but it shows how much trouble you can be in when the technology moves on.
Formats for other programs, especially custom-designed programs, may be impossible to decode. This is due to the complexity of digital files, which may need arbitrarily sophisticated algorithms to read.

A ticking time bomb

If the specifications for any file format are lost, then all information stored in that format will also be lost. Decrypting a pdf file, for instance, would be impossible without the specifications. Such specs typically run from hundreds to thousands of pages.
The problem of information being lost in old data formats has alarmed the British National Archives. The chief executive of the National Archives, Natalie Ceeney, describes the situation as a “ticking time bomb.”
They have over 580 terabytes of data in unsupported file formats, and have
teamed up with Microsoft to create a range of systems to retrieve the information using emulators. Even when the emulators are built, transferring that amount of data to a new format is a formidable challenge, and will require large scale automated tasks.
The irony of the National Archives teaming up with Microsoft is that companies with proprietary software (such as Microsoft) have made the problem worse. The specifications for proprietary software and file formats are not freely available, because they are the commercial property of the company that owns it. This hinders the ability of others to make compatible applications.
Microsoft often gets the blame for this situation, partly because file formats for old versions of Microsoft software are not always supported by new versions. However IBM, Novell and other companies have done the same thing.
Product support cycles (the time that a company is willing to help you keep the software working) are usually between five and ten years. After that, you’re on your own. In many cases, the software itself disappears, and all that is left are the files. Files that were once considered important enough to save or archive, now with no way of being retrieved.

Drowning in information

But what if all that information was lost anyway? It seems like we are swimming –sometimes drowning- in information today. Digital information may not be more durable, but it is easy to distribute and easy to copy. It is the new revolution.
Perhaps the Gutenberg Project, the British National Archive, archive.org , and institutions like them are the key to immortal knowledge.
That may be the case, but the vast amount of information around us blinds us to the fact that the information is still transient. It is as if we were given a giant newspaper rack to store all our daily newspapers. Rather than throwing them out, we could keep them all in a room at the back of our house. However, newspapers still decay at the same rate.

Still, it is true that it is now easier to keep information alive. Sure, the information has to stay on the run, moving from storage to storage and home to home. Deletion and permanent erasure stalks it, one step behind, waiting for a slip-up. But with diligence, it is possible to keep the data alive. Storage is replaceable and cheap.

Information half-life

Once, people wrote letters to each other. Now they write emails. Let’s say that letters have some kind of mortality function with a half-life of six months. It’s not unreasonable. Most letters are disposed of within a month or two of arrival. A few are treasured for decades. Emails, too, have a mortality function, but their half-life is a lot longer. Excluding spam, I would guess their half-life would be around two years, considering losses due to changing or deleting email accounts, changing computers, forgetting passwords, accidental erasure and clearing of old files. That means that, even though the data will disappear eventually, on average it lives longer.
Information lifetime doesn’t tell the whole story. Part of the increase in information lifetime is the storage of items that would have previously been instantly destroyed, or at least disposed of in the near future. Communication between co-workers and friends was once almost entirely verbal, and therefore transient. Now, as we transfer daily communication to digital media, we leave a trail with more and more of our daily interactions. Our idle morning chit-chat threatens to outlive us.
Before computers, a paperback novel would have a brief moment of glory. Thousands of copies sit in displays in thousands of bookshops, stacked one on the other. Thousands of hands, thousands of readers, thousands of bedroom shelves. One by one, the copies would meet their ultimate fate: they would be broken, or burned or thrown into the trash. They would be shredded by playing children, or dropped in a puddle. Some would be stored in boxes in attics and sheds, eventually being warped by the elements, and rendered unreadable.
Only the few survived: the outstanding works that entered a second printing, and a third. The books that had instant fame, or a cult following. For them, immortality beckoned, but for most, they are gone. The vast majority of books published in the nineteenth century, for example, have completely disappeared.
Project Gutenberg can now grant immortality not just to the chosen few, but to unlimited numbers of works. But is this a good thing? Can we really be the custodians of an ever-increasing store of mediocre information? Perhaps there is merit in letting things die- a sort of information natural selection. Perhaps our wisdom on wikipedia or Yahoo Answers, or in tens of millions of blogs, journals, and newspapers, is best forgotten for the most part.
So we have a curious situation: On the one hand, information is constantly decaying, and is threatened by changes in technology and media storage standards. On the other hand, the world is experiencing an explosion in the total amount of stored information, much of which is of questionable value. From this, two conclusions can be drawn:
1) we need to focus on what information we want to store. We need to decide how important each data stream is, and how long it needs to be stored.
2) For the information that we decide to keep, we need to be diligent about keeping it.

When I say “we”, I am referring to us as individuals, organizations, and as a society. For instance, as individuals, how many emails do we wish to keep? At your current rate, how much email data will you have in twenty years’ time, and how will you manage it? This is a question that is better asked sooner than later.
Plato’s work has survived for two and a half millennia. It is likely to survive just as long again, but if it is buried under a mountain of mediocrity, nobody might bother to read it.



Further reading for the interested:
Media Science: longevity of data
Problem of media obsolescence
Techworld take on the storage issue
A history of Visicalc by its creator
Webmasterworld discussion: digital data a ticking time bomb (the thread that got me thinking about and researching the issue, and eventually led to this article)

9 comments:

J-Dog said...

Outstanding post - thanks. BTW - You do NOT need to save this comment.

Dave said...

Thanks j-dog. You've flattered my ego too much-- I can't bring myself to delete your comment! :-)

Michael Anissimov said...

How about we just use configurations of carbon nanotubes to hold info? They are very strong and would not degrade.

Nice post!

Chris Williamson said...

What's so great about digital technology is the 'potential' to last forever. Before digital technology, there was absolutely no way a piece of information could last forever. I'm curious about the possibility of a human mind gaining the ability to last forever...that'd be kinda cool!

conrad said...

Actually, it isn't so outstanding, since its historically incorrect. Gutenberg didn't invent the printing press, nor moveable type, and not by hundreds of years. The exact origins are unknown, but the general areas is (i.e, east Asia).

This is just a weird white-man belief that is completely incorrect (there isn't even any argument). The wiki entry provides a good summary -- the main question is where Gutenberg got the idea from. I've no idea why so many people believe it. It reminds of Australia, where people think Aborignies only lived in deserts. Another weird thing that only white people believe.

Dave said...

Conrad, thanks for the correction: I've changed "invented the printing press" to "developed the printing press."
Also:
* It's a little churlish to dismiss the article based on a minor factual error in a single sentence, that was effectively an aside;
* your implication that "white people" have a noteably large number of misconceptions (presumably in comparison to other people) is not only racist and false, but is absurd;
* your faith in wikipedia is misplaced, especially considering your dislike of misinformation. Incidentally, wikipedia is not "wiki". Wiki is now a generic term, wikipedia is a specific entity.
but again, thanks for the feedback.

conrad said...

Duftus,

do your own experiment to see whether my claim is false or not, and go down the street and ask people where Aboriginies lived. Try this with white people. A: The desert. Now go down the street, ask the same of black people. A: Everywhere.

Now go down the street again and ask some white people who invented the printing press: A: (1) Don't know; (2) Gutenberg. Now go find some Japanese/Chinese people. A: We did.

So my claim is neither racist nor false -- its simply a description of the current situation. These are outright historical revisionist lies that you are propagating, that have been used to oppress non-whites. We wouldn't want to think East Asians are smart and industrious would we ? and we wouldn't want to think that blacks actually lived in all the decent and most habitable areas either. I remember learning this garbage in high school, and you just get sick of it after a while. Its so well ingrained in people now that even smart liberal guys like you evidentally believe it.

Odile S said...

Conrad,

It's hard to be confronted with racism, isn't it? I'm white but still confronted with it. It disgusts me.

Dave,

Are you ok? Making a mistake doesn't mean you support racism. Obviously you don't or else you simply could have deleted Conrad's post.

btw, Conrad,

who invented something... Very well possible that today a child invents a printing machine. To find out that it already exists somewhere else.
Gutenberg did invent it. He was not the inventor? O, okay.

Odile S said...

Back to the post,

even computers forget. I can see how former co-students would use this as an argument for a 'the human mind is a machine' theory.
Same property does not mean same though.
I love that you gave me the link to Kafka and Plato, Dave.
Thanks.