You want to buy a rug or a calendar online. Or maybe you want to subscribe to some Internet publication. You fill in your address and credit card number, but the web site, hungry for information about its visitors, goes on to ask you your age, your income, your hobbies, or any of a number of other questions that are of questionable relevance to its business.
What many, if not most, people do in such situations is lie. You claim to be 97, to make $4 million a year, and to have a keen interest in ice fishing in Honolulu.
The companies making these requests sometimes have their reasons. They want to target their ads, improve services, respond to customers' desires, and for all this it helps if they have a rough idea of who's buying their products.
Recently Rakesh Agrawal and Ramakrishnan Srikant, two IBM researchers in California, have developed a simple program that might make customer duplicity less appealing. Based on the realization that companies often want data that is roughly accurate in the aggregate but not necessarily personally revealing, their program partially reconciles companies' desire for information with individuals' desire for privacy.
Here's how it handles a nosy question: You answer it honestly, the program generates a random number that is either added to or subtracted from this answer, and only this last number is sent to the company. Using straightforward statistical techniques, the company can still recover approximate averages and correlations from the numbers submitted, and this is often sufficient for their purposes.
For example, someone indicates that she is 46 years old and the program adds or subtracts some random number between zero and 20 to this age so that the number the company receives might be 60 or 33 or any number between 26 and 66. Likewise with income figures. Someone says he makes $120,000 a year and some number between zero and $50,000 is added to or subtracted from this number before it is submitted to the company.
There is a trade-off. The numbers are more valuable and accurate if the random fudge factor added or subtracted to the correct answer is small. The smaller the number, however, the less protection afforded the individual. Another problem is that the customer has to trust the company enough not to record the individual random number used to mask the exact answer.
There are variations possible involving the distribution of the numbers added or subtracted. They can be adjusted to accommodate customers' residual lying in somewhat the same way that people, knowing that others generally want to appear younger and richer than they are, interpret statements about age and income.
The mathematics comes in when trying to reconstruct the true averages and correlations from all the more or less false individual numbers. The program uses Bayes' theorem, a powerful result in probability (that was implicit in January's column on terrorists and privacy), to help in this reconstruction.
Have You X-ed? Another Example