Cyrus Stoller home about

When does data become identifiable?

Issues regarding data privacy have become increasingly important as tech companies continue to develop new ways of monetizing the information they have gathered about their users. In response, many new privacy regulations are being instituted around the world, from CCPA in California to GDPR in Europe. At the heart of these laws is what constitutes personally identifiable information.

tl;dr, The legal definition of personal information is vague and difficult to apply.

CCPA defines “personal information” as “information that identifies, relates to, describes, is capable of being associated with, or could reasonably be linked, directly or indirectly, with a particular consumer or household.” And, GDPR defines “personal data” as “any information relating to an identified or identifiable natural person (‘data subject’); an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person.”

Determining identifiability in some cases is obvious. For example, if a user provides their social security number, because each number is associated with a specific individual, then it’s pretty clear that that’s personal information. On the other hand, if a user fills out a multiple choice survey that is submitted anonymously without any attached metadata, then it’s reasonably safe to assume that it’s not personally identifiable (presuming there are a sufficient number of survey-takers). But, most cases fall somewhere in between these two extremes.

In this post, I’ll highlight a distinction that seems to be missing between these two extremes. The current definitions have an inadequate level of specificity regarding personal information. This means that many types of data may not receive necessary protections (leaving some data vulnerable) and some may receive unnecessary protection (potentially impeding beneficial innovation). All of this is to say that different levels of identifiability aren’t black and white. We need to shift our mindset to evaluate identifiability on a spectrum.

Note: The harms of data misuse or lack of protection are well documented. This post is focused on exploring how to answer the question of whether given data is identifiable.

Inferred data

Data can be used to infer information that you’d reasonably expect to be kept private, including sensitive data (e.g., health status and credit-worthiness) and information about race, sex, religion etc … (which are protected classes in the American legal system). This is important because sensitive data, “suspect class” data, or other seemingly-benign data can be used in discriminatory ways. Here are some examples.

In other words, data can implicitly be used to identify characteristics about a user that they may not have intended or wanted to share. In isolation a single data point may not be useful in identifying someone, but when taken together, a constellation of data points can create a recognizable pattern. This should not be surprising anymore as we have all become accustomed to ad targeting being eerily in sync with our interests. Read more here.

Example of the challenge of identifiability: voice data

With the rise of personal assistants like Siri, Alexa, Google Assistant, Portal etc … we are handing over more voice data than ever before. Should a short voice recording (that is unlabeled and unmorphed) be considered personally identifiable? If this kind of short audio recording were shared with the broader organization collecting the data (e.g., Apple, Amazon, Google, Facebook, etc …, or a hacker), it would be hard to match this audio recording to a specific person. However, if I heard a short recording of my own voice, I would likely be able to identify it as my own.

On the other hand, a longer audio clip where users provide context clues that can be used to identify them would be a different story. It seems clear that the broader organization collecting the data (e.g., Apple, Amazon, Google, Facebook, etc …, or a hacker) would be able to easily identify the user who created the recording with minimal effort.

It seems inappropriate for there to be a one size fits all approach to these two voice recordings. Both voice files can be personally identified, given the right contextual information and technological capabilities. The challenge is that big tech companies, non-tech companies, and other potential adversaries have different capabilities to identify people based on their voices. Because of this, regulations should strive to incorporate nuance between these different types of identifiability.

Proposed distinction: P vs NP-style data

In describing these different types of information, it reminded me of the distinction between what computer scientists refer to as P and NP problems.

In layman’s terms, the solution to a P problem can be found in polynomial time. In other words, given a set of inputs, I can use a simple algorithm to find a solution. In the case of personally identifiable data, that would mean, if I am given a data sample, I can run a reasonably straightforward model to determine who the data belongs to. In its simplest form, this would mean looking at the unique identifier associated with it. In a slightly more complicated example, this may mean computing a faceprint from an image and then comparing that to a database of faceprints to identify the person in the photo. There may be some margin for error, but this process should never devolve into a brute force exercise.

On the other hand, with an NP problem, I can identify whether a proposed solution is valid in polynomial time. The key distinction here is that for an NP problem, I cannot necessarily find a solution in polynomial time, but I can verify that a solution is valid in polynomial time. For example, I can easily tell when a jigsaw puzzle has been solved correctly, but I cannot easily solve a jigsaw puzzle by simply looking at the pieces. Alternatively, presume that a model can detect that the same person is speaking in two recordings. Then, you can quickly identify that there is a match. But, if you only have one recording it’s much harder to determine who is speaking in the recording.

While both types of information may be personally identifiable, I would argue that in my analogy, personally identifiable information in the “P-style problem” space should be granted a heightened level of privacy protection. In other words, whether under CCPA or GDPR, there should be an expectation for consumers that this type of personally identifiable information be deleted upon request. On the other hand, if data is “NP personally identifiable” I think that companies should be granted more slack if they lack the necessary contextual information and technological capability to make the match.

When thinking about the risk profile for these different types of information, it seems clear that “P-style personally identifiable information” can be easily linked to a specific person even with limited sophistication. This is particularly worrisome since this type of data may be divulged as part of data breach to the open internet. This means that an individual with a vendetta could weaponize this data relatively anonymously, meaning that those who suffered harm would be less likely to find a remedy through the justice system for this harm.

On the other hand, for “NP-style personally identifiable information” to be used, an organization would need to be able to contextualize that data. For a given data set, due to practicalities, only a limited number of companies and state actors have these capabilities. Oftentimes, these organizations will be large companies that have a large corpus of data; but, there may also be small companies that have the necessary data for contextualization (e.g., a data set about a company’s employees).

The current critiques of big tech companies regarding data protection are warranted, but identifiability isn’t only contingent on having large data sets. As discussed earlier, an organization only needs to have the right contextual information to identify the data.


Currently, personally identifiable information is generally treated as a single classification. How to treat P-style data within this framework is clear; the question is how to approach NP-style data. With greater nuance, regulations can be more impactful and enforceable, giving individuals more control of their data.

In the absence of clear guidance, a likely default is for companies to adopt the least restrictive approach since they have a clear incentive to minimize risk by labeling as little data as possible as “personally identifiable” and therefore avoiding regulatory requirements.

I’d love to hear your thoughts on whether this distinction would be helpful in crafting your internal data practices.

Category Reflection