In 2010, the Library of Congress and Twitter announced a historic and incongruous partnership: Together, they would archive and preserve every tweet ever posted, creating a massive store of short-form thoughts. It was odd: a 210-year-old institution partnering with a four-year-old startup, cataloging the internet’s ephemeral #brunchtweets. It was also fascinating: equal parts futuristic and anachronistic. I imagined library scribes copying tweets by hand onto vellum or cranking feeds through a printing press. The news actually frightened some folks: Does this mean my future grandkids will read my live-tweets of Parks and Recreation?
Yet, however dubious the task seemed back then, no one doubted the Library of Congress would get the work done. If Twitter could handle a few million tweets a day, surely the largest library in the world could, too.
But as it turns out, it couldn’t. Six years after the announcement, the Library of Congress still hasn’t launched the heralded tweet archive, and it doesn’t know when it will. No engineers are permanently assigned to the project. So, for now, staff regularly dump unprocessed tweets into a server—the digital equivalent of throwing a bunch of paperclipped manuscripts into a chest and giving it a good shake. There’s certainly no way to search through all that they’ve collected. And, in the meantime, the value of a vast tweet cache has soared. This frustrates researchers, who had hoped to mine the archive for insights about language and society—and who currently have to pay heavy licensing fees to Twitter for its data.
The library has been handed a Gordian knot, an engineering, cyber, and policy challenge that grows bigger and more complicated every day—about 500 million tweets a day more complicated. Will the library finally untie it—or give in and cut the thing off?
“This is a warning as we start dealing with big data—we have to be careful what we sign up for,” said Michael Zimmer, a professor at the University of Wisconsin-Milwaukee who has written on the library’s efforts. “When libraries didn’t have the resources to digitize books, only a company the size of Google was able to put the money and the bodies into it. And that might be where the Library of Congress is stuck.”