Archive for the ‘data mining’ Category

Implications of Privacy on Data Cleansing

April 27, 2009

Earlier this month, LA Times reported about online maps made available by LAPD, showing crime locations incorrectly:

The distortion — which the LAPD was not aware of until alerted by The Times — illustrates pitfalls in the growing number of products that depend on a computer process known as geocoding to convert written addresses into points on electronic maps.
In this instance, is offered to the public as a way to track crimes near specific addresses in the city of Los Angeles. Most of the time that process worked fine. But when it failed, crimes were often shown miles from where they actually occurred.
Unable to parse the intersection of Paloma Street and Adams Boulevard, for instance, the computer used a default point for Los Angeles, roughly 1st and Spring streets.

Highest crime rate in L.A.? No, just an LAPD map glitch – Los Angeles Times

Data collection is often a process prone to errors and omissions. Some values may be missing, because they weren’t fed into the system in the first place. Some kinds of data, such as addresses, may appear in different forms because of spelling errors. Reports written by hand, like the ones lying at the source of crime reports, may be hard to decipher. On top of it, further processing, like the geocoding mentioned above, may introduce additional errors.

Data analyzers usually take this kind of errors into account. Quite often, the first step in analysis is data cleansing, which aims to find these errors and correct them (or at least remove them). When the data is available in raw form, data cleansing is easier – data points can be evaluated separately for errors. LA Times reporters could identify errors because they looked at specific addresses and could see that they make no sense.

It is not always practical to rely on the data provider to take care of data cleansing. For example, in this case – sure, LAPD will take care of the errors found by LA Times, but

A spokesman for the LAPD said the department will add a disclaimer to its site once it’s cleared by the city attorney.

It seems like the responsibility to check the data is laid on the data consumer. In the case described above, anyone who consumes the data can judge the quality of data and apply cleansing, since the data is available in its raw form.

Now let’s imagine a similar case, but with a little twist: what would happen when sharing data in a restricted way, due to privacy concerns? Many research papers on privacy study how data can be used to learn patterns or aggregates, while keeping individual records masked. Various approaches were proposed, including randomizing the input before processing, adding noise to the outcome of the processing, “Blurring” data by generalizing values so they are not too specific, and so on. All these approaches have at least one thing in common: there is no longer access to the raw data. In turn, this may impact the ability of the data consumer to cleanse the data.

Privacy research is still in the process of figuring out how various computations can take place on top of sensitive data. However, the output of each process is only as good as the input. Without the ability to identify bogus records, any analysis could lead to false conclusions. The problem here is a little different than that of privacy preserving aggregate calculation: it is unclear whether it will be possible to tell the difference between a unique but correct record (a legitimate outlier) and a bogus record, without scrutinizing the fine details.