Culturomics: Mining 'Genome' of Words for Cultural Trends
Harvard and Google discover cultural trends with database of digitized books.
Dec. 17, 2010— -- Ever wonder how many new words enter the English language each year? Or what fame is like for celebrities today versus during the 1800s? Or which words are most deeply ingrained in our collective subconscious?
The billions of words in all the books ever published can provide an interesting window into how the winds of culture change direction. But plowing through centuries of primers, paperbacks and other publications is hard work.
To ease the burden, a team of Harvard researchers worked with Google to devise a way to quantitatively analyze human culture.
In a paper published Thursday in the journal Science, the researchers unveiled their insights on "culturomics," along with some fascinating findings about what we've stored in our collective memory.
"People have been wondering about human culture and human society and about what makes the fabric of our society way before the sciences were there," said Jean-Baptiste Michel, a key researcher on the study and a postdoctoral fellow in Harvard's department of psychology. "But interestingly enough, I don't think there was ever a quantitative way to get at those questions. We felt that it was very interesting that the tool that would enable one to ask questions about human culture quantitatively and in a reproducible manner [would] deliver interesting insights on culture."
Realizing that Google was already conquering the data acquisition piece of a quantitative tool by digitizing books, Michel and his Harvard colleague Erez Lieberman Aiden approached the search giant for help.
Soon enough, the academics and the engineers joined forces to create and analyze a dataset of words that would resemble a "fossil record" of human culture.
The dataset, which can be downloaded and searched with a Google tool launched Thursday, is the largest data release in the history of the humanities, the researchers said. If the words were arranged in a straight line, they would extend to the moon and back 10 times.
The database includes more than 500 billion words, based on the full text of about 5.2 million books (which is about 4 percent of all books ever published).
Aiden said the earliest books date back to the early 1500s, but the vast majority of the data are from the last 200 years.