AI Knows a Lot About ‘Anonymous’ Data Feeds—How Do We Find Out?

By Marty Graham, Contributor

The ways to give away personal data grow by the day. Almost every smartphone app collects data ranging from geographic location to what you look for in a restaurant or partner. The Internet of Things, one of the newer data set gatherers, as well as home security systems, voice-activated assistants, cars, and Siri—pretty much anything you can switch on from a smartphone—is gathering data about your habits and tastes.

A recent study found that visiting the home page of the most popular 100 websites left 6,200 cookies on a PC—with 83 percent of those coming from a third party host. As a result, Google, Facebook, Twitter, and Amazon may know more about us than we tell our friends.

That’s not all bad. Big data is the basis for artificial intelligence (AI), and the more data that goes in, (in theory) the more accurate—and useful—the output. For example, huge data sets from medical records are being used to discover patterns that will change the way medicine is practiced. And all that data can help create better services, care, and offerings.

But individual control over data collection is virtually non-existent. Individuals who don’t agree to have their data mined voluntarily exclude themselves from the major social networks and the most basic internet functions, including Google’s suite of services (such as Gmail and Google Docs) that are widely used by colleagues and friends. These people also restrict themselves from smartphone apps that are not encrypted. Regardless of whether it’s healthcare data or Google Analytics, privacy is a thorny issue.

The Data-Privacy Paradox

Although data collectors promise individual anonymity will be protected by de-identification, it’s entirely possible to cobble together enough overlaps in the data to identify those who have theoretically disappeared through anonymization, says Peter Swires. Swires, a law professor at Georgia Tech and one of the early privacy advocates, worked with President Bill Clinton on health records privacy and served as an advisor to President Barack Obama.

“There’s a whole science built around re-identification,” Swires explained. “There’s no requirement that anyone reveal the data they’ve gathered to the people whose data was gathered, and there aren’t agreed upon standards, just the missive to anonymize data.”

Kamalika Chaudhuri, a leader in privacy advances, spends her time working on creating better ways to anonymize data. Her research is funded by the National Science Foundation and Defense Advanced Research Projects Agency (DARPA), to name a few. For Chaudhuri, there is a delicate balance between privacy, accuracy, and the size of the data set.

“If you have a lot of data, you can get both privacy and accuracy. You can draw statistically accurate conclusions without removing privacy,” she says. “But with a small sample. You can’t draw accurate conclusions without taking away the anonymity.”

To paraphrase the Fundamental Law of Information Recovery, overly accurate answers to too many questions will destroy privacy in a spectacular way. The law was crafted by Microsoft’s Cynthia Dwork, and a colleague, whose mission is to find ways to gather data without identifying people. (Written in 2014, scholars have already cited the paper 776 times and counting.)

Health systems in particular struggle with the conflict between privacy and accuracy. Alongside the University of California, San Diego Medical School Chaudhuri has been working to find ways to learn from data and protect patient privacy. She said that it’s easy to get to individual identities. “People’s data is always very unique,” she explained. “With just three identifying elements, you can find the person.”

Chaudhuri shares the widely held tenet that whoever gathers and holds the data set is obliged to protect the privacy of the people in it. Releasing the data in a form that disguises the people in the data set, she said, should be the goal of collection.

“It’s not always possible to release the data set so you can both protect privacy and have the data be useful,” she said, citing health records as a perfect example. “But you can build a model based on the data so you can release the model and still preserve privacy.”

Privacy Semantics

For the last decade, governments and non-governmental organizations have been engaged in the conversation about how to better protect people’s privacy. While there’s agreement on big ideas, such as anonymization, we are just now beginning to see consensus around regulation and techniques. Enforceable standards for privacy and control over individual identity will fall in place for the first time this year in Europe.

In May, the European Union’s General Data Protection Regulation (GDPR) went into effect two years after it was approved. The regulation claims enforceability over every company that has access to the private data of European Union residents, regardless of where the company is located. Individual users will gain the right to see the data gathered on them as well as “the right to be forgotten”–the ability to have their data erased.

Of course, determining what qualifies as privacy can be hairsplitting. Consider the notion of differential privacy. “Differential privacy is where you would find out the same thing about me if I wasn’t in the data set. That is not a violation of privacy,” Chaudhuri says. “But if I was not in the data set, they would not have been able to draw the conclusion, that is a violation of your privacy.”

For example, a medical study on sexually transmitted diseases using anonymized data may conclude that men between 19 and 26 are more likely to contract them. But if the study resulted in the conclusion that men of a particular ethnicity who live on military bases—particularly pilots, let’s say—were more vulnerable, that would violate the privacy of this subset of men because the data included traits that identified them, and without them, the outcome would have been different.

Private industry is increasingly focused on how to get what they want from the data they gather without crossing that fine privacy line. Chaudhuri says one solution is to add noise to the data set—mixing in some inaccurate information along with the accurate, something Google does.

“If Google tells you they’re going to collect a lot of your data, but don’t worry, they won’t use it or share it, you are not necessarily going to trust the company,” she explained. “What Google has done is build a system called Rappor for collecting data in a noisy form. People are not going to trust them with the raw data.”

The Subtle Creepiness of Data Collection

Even with oversight, the average web user is still at risk. Just months before Cambridge Analytica was found to have grabbed 87 million people’s personal data from Facebook, the social media giant turned in an audit of its privacy practices (required by a court consent decree) that deemed the company’s privacy protection practices to be sufficient and effective – even as the Cambridge Analytica scandal was harvesting.

But even before the Facebook leak, consumers were weary of companies collecting their data, threatening to abandon their patronage with a breach. Today, many consumers fear their data is used to sell them more things or influence their opinion without their consent. In fact, according to the RSA Data Privacy and Security Report, 41 percent of respondents admitted to intentionally falsifying personal information and data when signing up for products and services online.

In some cases, public use of the data is just plain creepy. Recent tweets by Netflix and Spotify left some of their customers alarmed. In December 2017, Netflix tweeted: “To the 53 people who’ve watched A Christmas Prince every day for the past 18 days: Who hurt you?” Spotify ran ads based on their user data that was both quirky and creepy, including: “Dear person who played ‘Sorry’ 42 times on Valentine’s Day, what did you do?”

Did those consumers agree to let those companies use their data? Yes. Did they thoroughly understand and consider what could happen? Likely not. Did they agree to have their personal information resold? Even as anonymized data? Privacy experts say it’s not clear if those agreements were even valid.

“It’s arguably not a valid agreement to have 37 pages of privacy and use policy where what you do is hit I agree,” says Mark Halverson, a co-chairman of the Institute of Electrical and Electronic Engineers committee working on ethical and privacy issues in artificial intelligence.

One federal agent, who prefers not to be identified, said he sits in meetings where people talk about protecting their personal data. When he hears that, he pulls out his smartphone and sets it on the table. His response: “If you carry one of these, you’ve already conceded that point.”