Did Netflix Out a Customer? Your Private Details May Be Fodder for a Contest
Suggesting movies is risky business. User picks reveal more than you may think.
Feb. 7, 2010 -- My university, which is in Philadelphia, recently sent an e-mail to faculty and staff reiterating its privacy policy.
Specifically, it said that the Pennsylvania Breach of Personal Information Act requires us to notify a person if we disclose personal information.
Personal Information is defined as "the first name or initial and last name in combination with one or more of the following nonpublic unencrypted pieces of information: a Social Security number, a driver's license number or state identification card, financial account number, credit card or debit card number accompanied by the applicable passwords or security codes."
The Netflix Contest and Lawsuit
This is a laudable policy, but almost immediately after reading this e-mail, I read about a contest that the DVD movie rental company Netflix conducted over the last few years.
The contest was intended to elicit from the general public algorithms that would enable the company to improve its suggestions for future selections. To do this, Netflix released a huge trove of data about users' picks of past movies and their ratings of these movies.
Since it wanted a better way of determining other movies these users might like or dislike, the company announced a $1 million prize. The prize would be awarded to that group of researchers whose predictions about a different trove of movie ratings data involving these same users were most accurate.
The users were anonymous, identified only by number. No names, Social Security numbers, drivers' license numbers or financial account figures were released, so the contest complied with Pennsylvania's policy on privacy, undoubtedly a common one across the country.
Netflix also took other measures to anonymize the information, but this did not prevent the company from being sued recently as part of a class action suit by a subscriber for violation of her privacy. An unnamed, in-the-closet lesbian mother has alleged that, by not adequately anonymizing the data set, Netflix outed her and thereby caused her economic and psychological harm.
The explanation: It turns out that people were able to identify specific users by matching their Netflix reviews and ratings with some signed ones the users had posted on the Internet Movie Database.
Nevertheless, the lawsuit maintains that in releasing the large data set Neflix had violated the very strict Video Privacy Protection Act, which was passed when the movie choices of Supreme Court nominee Robert Bork were obtained from a video store.
Another Netflix Contest Even More Revealing?
The $1 million winner was picked in the first contest, and now Netflix is said to be planning a second contest to further improve its prediction algorithms.
In this contest, it will provide users' individual ratings of movies but anonymize the users by "only" providing their birth dates, zip codes, and genders. "Only" is in quotes since this information is even more revealing than that released in the earlier contest.
A look at the numbers hints at why. If, as a first approximation, we assume that people live to age 75, then we have about 27,375 (75 x 365) possible birth dates. Since there are approximately 43,000 5-digit zip codes, and 2 genders, the so-called multiplication principle says that there are about 27,000 x 43,000 x 2 possible sets of birth dates, zip codes, and genders.
This product equals about 2.3 billion, a number far greater than the 300 million population of the U.S. Since the number is so much greater than the population, it's not surprising that many Americans are uniquely defined by their birth date, zip code and gender.
Think of 2.3 billion baskets, each with a different set of these three numbers printed on the side. Further imagine that each of the 300 million Americans is placed in the appropriate basket. Surely many Americans will find themselves alone in their very own basket and thus uniquely identified.
Of course, this is a great simplification. Birth dates are not evenly distributed throughout the last 75 years, some zip codes contain a lot of people, others not many. Some contain a disproportionate share of young people, others of old people, and so on. Even if the age rather than the birth date were revealed, many Americans still would be uniquely identified.
Still, one can check empirically using census data or make a priori probability arguments to conclude that a substantial majority of Americans would be uniquely identified by the proposed contest.
Prediction and Privacy
The point is that it's very difficult to release any information about a person or group that won't, to a sufficiently curious and diligent researcher or advertiser, sometimes reveal private aspects of that person. Bits of information are rarely orphans and are becoming increasingly linked in unpredictable ways.
Consider the intricate interconnections of Twitter World or the Hall of Mirrors that is Facebook. Even a couple's unusual pair of names (say Waldo and Gertrude) might be enough for a savvy sleuth to uncover all sorts of information about them.
Recommending Books and Movies Is Risky Business
Of course, the difficulty of preserving our privacy isn't an argument for not taking reasonable precautions to do so. Unlike the first contest, the second one, should Netflix go through with it as it's rumored to be structured, would seem to be an intentional violation of privacy.
In any case, the issue of coming up with better predictions of what customers would like is a general one. Although Amazon has not outsourced its algorithms, for example, it is naturally very interested in suggesting books a particular reader would like based on his or her past choices.
This brings me to an anecdote about a bookstore I visited years ago. I asked the clerk if he knew where Wittgenstein's Tractatus might be located, and he pointed me to the automotive section where, sure enough, there it was. The problem was that Wittgenstein's book is a seminal book in 20th century philosophy.
Conclusions: Recommending books and movies is a risky business, knowing what books or movies a person likes is often quite revealing, and safeguarding people's privacy will become increasingly difficult, even with the best of intentions.
John Allen Paulos, a professor of mathematics at Temple University in Philadelphia, is the author of the best-sellers, "Innumeracy" and "A Mathematician Reads the Newspaper," as well as (just out in paperback) "Irreligion: A Mathematician Explains Why the Arguments for God Just Don't Add Up." His "Who's Counting?" column on ABCNews.com appears the first weekend of every month.