Incomplete or inaccurate personal data (corrupted, out-of-date, useless)
Deloitte Analytics conducted a survey testing how accurate commercial data used for marketing, research and product management is likely to be. They found that:
More than two-thirds of survey respondents stated that the third-party data about them was only 0 to 50 percent correct as a whole. One-third of respondents perceived the information to be 0 to 25 percent correct.
Whether individuals were born in the United States tended to determine whether they were able to locate their data within the data broker’s portal. Of those not born in the United States, 33 percent could not locate their data; conversely, of those born in the United States, only 5 percent had missing information. Further, no respondents born outside the United States and residing in the country for less than three years could locate their data.
The type of data on individuals that was most available was demographic information; the least available was home data. However, even if demographic information was available, it was not all that accurate and was often incomplete, with 59 percent of respondents judging their demographic data to be only 0 to 50 percent correct. Even seemingly easily available data types (such as date of birth, marital status, and number of adults in the household) had wide variances in accuracy.
Nearly 44 percent of respondents said the information about their vehicles was 0 percent correct, while 75 percent said the vehicle data was 0 to 50 percent correct. In contrast to auto data, home data was considered more accurate, with only 41 percent of respondents judging their data to be 0 to 50 percent accurate.
Only 42 percent of participants said that their listed online purchase activity was correct. Similarly, less than one-fourth of participants felt that the information on their online and offline spending and the data on their purchase categories were more than 50 percent correct.
While half of the respondents were aware that this type of information about them existed among data providers, the remaining half were surprised or completely unaware of the scale and breadth of the data being gathered.
Report: CDC used bad data to judge DC water safety (incomplete)
In 2000, a problem with Washington, DC’s drinking water began when officials switched the disinfectant they used to purify the water. The switch was supposed to make the water cleaner. But the change also increased corrosion from the city’s lead pipes, upping the amount of lead in the water.
City officials knew about the problem, but failed to quickly warn residents, according to the report. In January 2004, The Washington Post exposed the issue. In response to a public outcry, D.C. sought help from the Center for Disease Control in evaluating the impact of the high lead levels.
The CDC study was reassuring. It found that the high lead levels were not noticeably harming city residents. But the congressional investigation says the CDC study was based on “fundamentally flawed and incomplete data.”
Predictive policing (biased, incomplete)
Police are increasingly using predictive software. This is particularly challenging because it is actually quite difficult to identify bias in criminal justice prediction models. This is partly because police data aren’t collected uniformly, and partly because the data police track reflect longstanding institutional biases along income, race, and gender lines.
While police data are often described as representing “crime,” that’s not quite accurate. Crime itself is a largely hidden social phenomenon that happens anywhere a person violates a law. What are called “crime data” usually tabulate specific events that aren’t necessarily lawbreaking—like a 911 call—or that are influenced by existing police priorities, like arrests of people suspected of particular types of crime, or reports of incidents seen when patrolling a particular neighborhood.
Neighborhoods with lots of police calls aren’t necessarily the same places the most crime is happening. They are, rather, where the most police attention is. And where that attention focuses can often be biased by gender and racial factors.
A recent study by the Human Rights Data Analysis Group found that predictive policing vendor PredPol’s purportedly race-neutral algorithm targeted black neighborhoods at roughly twice the rate of white neighborhoods when trained on historical drug crime data from Oakland, California. Similar results were found when analyzing the data by income group, with low-income communities targeted at disproportionately higher rates compared to high-income neighborhoods. This was despite the fact that estimates from public health surveys and population models suggest that illicit drug use in Oakland is roughly equal across racial and income groups. If the algorithm were truly race-neutral, it would spread drug-fighting police attention evenly across the city.
Similar evidence of racial bias was found by ProPublica’s investigative reporters when they looked at COMPAS, an algorithm predicting a person’s risk of committing a crime, used in bail and sentencing decisions in Broward County, Florida, and elsewhere around the country. These systems learn only what they are presented with; if those data are biased, their learning can’t help but be biased too.
Google’s Flu Trend (incompatible)
Launched in 2008 in the hopes of using information about people’s online searches to spot disease outbreaks, Google’s Flu Trend would monitor users’ searches and identify locations where many people were researching various flu symptoms. In those places, the program would alert public health authorities that more people were about to come down with the flu.
But the project failed to account for the potential for periodic changes in Google’s own search algorithm. In an early 2012 update, Google modified its search tool to suggest a diagnosis when users searched for terms like “cough” or “fever.” On its own, this change increased the number of searches for flu-related terms. But Google Flu Trends interpreted the data as predicting a flu outbreak twice as big as federal public health officials expected, and far larger than what actually happened. This is a good case of bad data because it involves information biased by factors other than what was being measured.