Why Theoretical Computer Scientists Aren’t Worried About Privacy

There has been a lot of news recently on government surveillance of its citizens. The biggest two that have pervaded my news feeds are the protests in Turkey, which in particular have resulted in particular oppression of social media users, and the recent light on the US National Security Agency’s widespread “backdoor” in industry databases at Google, Verizon, Facebook, and others. It appears that the facts are in flux, as some companies have denied their involvement in this program, but regardless of the truth the eye of the public has landed firmly on questions of privacy.

Barack Obama weighed in on the controversy as well, being quoted as saying,

You can’t have 100% security and 100% privacy, and also zero inconvenience.

I don’t know what balance the US government hopes to strike, but what I do know is that privacy and convenience are technologically possible, and we need not relinquish security to attain it.

Before I elaborate, let me get my personal beliefs out of the way. I consider the threat of terrorism low compared to the hundreds of other ways I can die. I should know, as I personally have been within an $ \varepsilon$ fraction of my life for all $ \varepsilon > 0$ (when I was seven I was hit by a bus, proclaimed dead, and revived). So I take traffic security much more seriously than terrorism, and the usual statistics will back me up in claiming one would be irrational to do otherwise. On the other hand, I also believe that I only need so much privacy. So I don’t mind making much of my personal information public, and I opt in to every one of Google’s tracking services in the hopes that my user experience can be improved. Indeed it has, as services like Google Now will, e.g., track my favorite bands for me based on my Google Play listening and purchasing habits, and alert me when there are concerts in my area. If only it could go one step further and alert me of trending topics in theoretical computer science! I have much more utility for timely knowledge of these sorts of things than I do for the privacy of my Facebook posts. Of course, ideologically I’m against violating privacy as a matter of policy, but this is a different matter. One can personally loathe a specific genre of music and still recognize its value and one’s right to enjoy it.

But putting my personal beliefs aside, I want to make it clear that there is no technological barrier to maintaining privacy and utility. This may sound shocking, but it rings true to the theoretical computer scientist. Researchers in cryptography have experienced this feeling many times, that their wildest cryptographic dreams are not only possible but feasible! Public-key encryption and digital signatures, secret sharing on a public channel, zero-knowledge verification, and many other protocols have been realized quite soon after being imagined. There are still some engineering barriers to implementing these technologies efficiently in large-scale systems, but with demand and a few years of focused work there is nothing stopping them from being used by the public. I want to use this short post to describe two of the more recent ideas that have pervaded the crypto community and provide references for further reading.

Differential Privacy and Fully Homomorphic Encryption

There are two facts which are well known in theoretical computer science that the general public is not aware of. The first is about the privacy of databases:

There is a way to mine information from a database without the ability to inspect individual entries in the database.

This is known as differential privacy. The second is no less magical:

There are secure encryption schemes which allow one to run programs on encrypted data and produce encrypted results, without the ability to decrypt the data.

This is known as fully homomorphic encryption. 

The implications of these two facts should be obvious: search engines need not know our queries but can still fetch us search results and mine our information to serve ads, Facebook need not have access to our personal data but may still accurately predict new friends, grocery stores can even know what products to place side by side without knowing what any individual customer has purchased. Banks could process our transactions without knowing the amounts involved, or even the parties involved. Perhaps most importantly, governments can have access to databases (in the form of differentially private queries) and mine for the existence of threats without violating any individual user’s privacy. If they get an indication of a terrorist threat, then they can use the usual channels (court orders) to get access to specific individual data.

It’s easy to argue that these techniques will never become mainstream enough for individuals to benefit from it. Indeed, we’ve had cryptography for many years but few average users actively encrypt their communication for a lack of convenience. And then there are questions of policy: why would any company relinquish the ability to work directly with user data? And the cost of rearchitecturing existing services to utilize these technologies would be enough to dissuade most business leaders.

But the point of all this is that these are problems of policy that could in principle be solved without waiting for governments and corporations to get their act together. With enough demand for such services and with enough technologically-minded entrepreneurs (I’m looking at you, Silicon Valley), it would be just a matter of time before the world was differentially private. Mathematics cannot be revoked or legislated away.

Fully Homomorphic Encryption

A fully homomorphic encryption scheme is a normal encryption scheme (two functions “enc” and “dec” to encrypt and decrypt) with one additional function, which we’ll call “eval.” Very roughly, eval accepts as input the text of a program and a ciphertext, and produces as output a ciphertext such that the following diagram commutes:

FHE-diagram

That is, $ m$ is our message, and $ \textup{eval}$ runs $ f$ on the encrypted version of our message. In practice this happens by lifting two operations, multiplication and addition, from plaintexts (which are usually number-representations of letters) to ciphertexts (again usually numbers). Once this is done one can simulate the functionality of an arbitrary circuit on the encrypted data without decrypting it. Those readers who have been following our category theory series will recognize these sorts of diagrams as being functorial. [Actually, at the time of this writing we have yet to look at functors, but we will soon!] So perhaps a better term would be “functorial encryption.”

I should emphasize: a truly homomorphic encryption scheme has the ability to run any computable function on the encrypted data. There is no loss of functionality in preserving the privacy from the program runner. The main use of this is to maintain privacy while deferring large computations to the cloud. We do this all the time, e.g. a search query, but it also applies to big websites like Reddit, which operate entirely on Amazon Web Services.

Fully homomorphic encryption was first envisaged by Rivest, Adleman (two of the inventors of RSA), and Dertouzos in the late seventies, mainly because the RSA encryption scheme is close to being homomorphic (one can multiply ciphertexts, but not add them). In 2009, Craig Gentry released the first working fully-homomorphic scheme based on the mathematical theory of ideal lattices, and later that year he (with a group of other researchers) came up with a second system that is arguably as simple as RSA; it operates on integers with modular arithmetic.

Gentry has produced a lot of research since then in homomorphic encryption, but the interested reader should probably start with his tutorial paper describing his arithmetic-based system. From there, there are existing implementations in Python (using Sage) and C++, both of which are freely available on github.

Differential Privacy

The main idea of differential privacy is that one can add noise to statistical data to protect the identities of individual records. Slightly more rigorously, a randomized algorithm $ f$ is said to be $ \varepsilon$-differentially private if for all possible datasets (inputs) $ D_1, D_2$ which differ on a single record, and all possible collections of outputs $ y$ of $ f$, the probability of correctly guessing $ D_1$ from $ y$ is not significantly different from that of $ D_2$. In particular, their quotient is at most $ e^{\varepsilon}$ (this choice of using $ e$ is arbitrary, but makes the analysis nicer).

The motivation for differential privacy came from two notable events in which companies released “anonymized” data which was partially de-anonymized because it was too specific. The first was the million-dollar Netflix Prize contest to develop a better recommendation algorithm, and the second was the release of the Massachusetts Group Insurance Commission medical database. As such, many companies are very strict with how they handle their user data, and information sharing the medical community is practically nonexistent.

There are many known differentially private algorithms, and they’re much stronger than one would imagine at first. One can run random forests of decision trees, network trace analysis, query-click analysis, certain forms of clustering, and a whole host of combinatorial optimization problems. For a gentle introduction to differential privacy, see Christine Task’s lecture video, a Practical Beginner’s Guide to Differential Privacy. There is also an influential survey from Microsoft Research of Dwork. These go into much more detail about the abilities and inabilities of differential privacy than I could do here.

If there’s one thing to take away from this discussion, it’s that efficient protocols for ensuring privacy are out there waiting to be implemented in software. So while we complain and listen to others complain about governments violating our liberties (and indeed, this discussion is extremely important to have), let’s do a little mathematics, do a little computer science, and figure out how to make privacy the standard of measure in software.

Until next time!

23 thoughts on “Why Theoretical Computer Scientists Aren’t Worried About Privacy

  1. Thanks for a great introduction to recent advances in cryptography.

    As to the practical impact of these things, I think it’s limited to the relatively narrow community of people who think that they know what they’re doing. I, for one, would never trust good old-fashioned RSA-based encryption with 65536-bit keys, unless I could invest the considerable effort to convince myself that I trust the entire computing stack that I use for encryption (this, IMO, precludes the use of off-the-shelf code to a large extent; Ken Thompson’s “Trusting trust” is a good example why. The least I’d do is encrypt on a non-networked computer, preferably one that I helped design but that’s admittedly overly paranoid, and then I’d type the cyphertext by hand into a networked machine; of course this is more inconvenience than I’m willing to accept on a daily basis.)

    A layman’s common sense correctly tells him that encryption software does nothing for him (if he runs PGP on Windows, Mac OS X or Linux, there’s probably more than one key logger sending the plain text to Sony, the US government, a few Russian spammers and other interested parties.)

      • I think that Yossi was saying that you may not be able to review the Linux source code and find that keylogger. Ken Thompson explains how to put in a Trojan horse into a C compiler such that you can’t find it by reading the source code in the article referenced (which I’ve linked below). It’s a toy example, but I think it’s at least thought-provoking. So, when Jeremy says that there are “efficient protocols for ensuring privacy… out there waiting to be implemented in software”, I think that Yossi’s point is that implementing these protocols in a way that the resulting binary can be trusted is a non-trivial problem.

        http://www.ece.cmu.edu/~ganger/712.fall02/papers/p761-thompson.pdf

      • This is an excellent point I hadn’t considered. But if we’re can’t trust the software then it seems the problem goes deeper than what theoretical computer scientists can speak for

      • “Trusting trust” is why you use multiple compilers in multiple languages, and multiple assemblers, and multiple debuggers to examine the code. It ramps up the effort considerably (I think — naively each compiler is tasked with spotting the trojan code in all the other compilers, though of course there might be a way to embed the same trojan code in all of the compilers in the same way).

      • Yup, when the people who design the hardware and software work against you, no encryption can save you. But again, this is nothing that theoretical computer scientists can help. To us the problem is solved.

  2. Theoretical computer scientists are missing the point. The issue is not whether we have the technology. The issue is that people cannot or will not use it. We’ve had the technology to do end-to-end encrypted email since forever, yet such systems are not widely used today, because for most people the cost of learning about and doing key management is perceived not to be worth the benefits of encryption.

    One can respond to this conclusion in two possible ways. One way is to try to educate the public about cryptography. I imagine that is what Jeremy is trying to do here. Although I admire the sentiment, I have little optimism that this approach will work. It has not worked in the past. I can’t think of a single example where it has ever worked. Any encryption technology that relies on operator awareness ends up being brittle and fragile. A classic example is HTTPS, which has utterly failed to prevent phishing attacks, large-scale spying from the Iranian government, and various other bad things that it was designed to prevent.

    The other response is to design cryptographic protocols to minimize opportunities for human error. Academics do not normally consider this criterion to be important (perhaps because it is hard to model rigorously), but empirical evidence demonstrates that it is the only one that matters. SSH avoids key management entirely by resorting to trust-on-first-use (TOFU). Cryptographers will scream that TOFU is insecure. Well sure, it is insecure, but it’s far better than the secure alternative, because it takes the human operator out of the loop. For 99% of the public, delegating that decision to the computer improves overall security, because it makes the difference between using or not using SSH. Indeed, SSH has completely supplanted the telnet protocol, which is much more than you can say of, for example, HTTPS vs. HTTP. Another great example is iMessage, which is so secure that even the feds have trouble breaking it (http://news.cnet.com/8301-13578_3-57577887-38/apples-imessage-encryption-trips-up-feds-surveillance/). The vast, vast majority of iMessage users are completely unaware that they are using any encryption whatsoever.

    Education and outreach is a great idea in general. But we should not rely on it to drive deployment. Successful deployment of cryptography, security, and privacy technologies requires complete transparency and zero user interaction. Security is unlike any other endeavor in that a 99.99% success rate against attacks is actually a complete failure. Human error is the weakest link, and as explained above, I do not think education efforts can shore it up. The only alternative is to remove it from the equation entirely.

    • When I say this is from the point of theoretical computer scientists I’m using a bit of rhetoric. Of course theoretical computer scientists are worried about privacy. I’m just trying to make it clear that theoretical computer scientists consider privacy as a mathematical problem, and from that perspective it’s largely solved: the right definitions have been found, the important theorems are proved, and efficient algorithms are known. I would be surprised if anyone considered this common knowledge, esp. w.r.t. homomorphic encryption.

      • Even then I still think it’s a little premature to declare the problem solved. Fully homomorphic encryption, while it technically qualifies as efficient in the sense of polynomial time, is not quite efficient enough for routine use. I am confident that things will improve, but it will take some work to get there. There is also some uncertainty about the validity of the underlying security assumptions, as is always the case when new assumptions are introduced. The new assumptions are not as simple as prior crypto assumptions such as RSA, and it will take some work to analyze them.

  3. Good day,

    In my personal opinion, adoption is driven by the maturity, usability and availability of the technology. At this time, technologies like fully homomorphic encryption (FHE) are not yet easy to use. Encrypted e-mail is much easier to use as e-mail clients can help you in setting this up (not all, but it’s much closer to usability than FHE).

    When I was designing the the Sharemind privacy-preserving database (achieves similar results to fully homomorphic encryption), my main goal was to make the technology accessible to developers and end users. We have developer tools and real applications and we are looking for more.

    Check out how to privately predict satellite collisions:
    https://www.youtube.com/watch?feature=player_embedded&v=bAp_aZgX3B0

    We can also privately process genomic data analysis:
    http://bioinformatics.oxfordjournals.org/content/29/7/886

    If you want to play around with a tools demonstration, check out
    http://sharemind.cyber.ee/news-blog/sharemind-sdk-v201305-released

    The SDK currently uses the previous generation of our technology, but it gives a nice idea of what can be done.

    Full disclosure – I am the designer of Sharemind, but I strongly believe that it can improve the state of the art in online privacy.

    Dan Bogdanov

  4. Unfortunately, this is the wrong viewpoint: Privacy work is about “practically securing the implications of communication”, which sets it apart from secure communication, which is “securing the properties of the communication itself” through methods of cryptography, steganography, identity hiding, and other well-known theoretical computer science topics. There is nothing theoretical computer science can do if you don’t secure your communication and/or have no choice in doing so. However, privacy research would tell you to dial random numbers, use multiple cell phones, proxy forward calls, and so on, to obfuscate your digital trace.

    • Very true, but I think there’s something to be said for what can be done in theory, esp since many of those who dictate policy seem to be unaware of the possibilities. And I’m certainly being facetious by taking a mathematical viewpoint.

  5. Hi Jeremy. Thanks for the interesting article, which I found recently in the Naked Capitalism links. I have a question and a suspicion. The question is whether, even if the methods you describe were in use, isn’t the individual still at the mercy of the programs that run on his or her encrypted data? If a program selects a set of data, won’t it then be decrypted to be investigated by humans who presume you’re a suspect because the program said so? I guess it’s better than anyone being able to look you up, but still unsettling.
    My paranoid suspicion is that these methods, however technologically feasible, will never be implemented, because part of the system’s purpose is to intimidate the population.

    • As with other encryption schemes these make theoretical guarantees that they can’t be broken, predicated on the widely believed but not mathematically proven assumptions. So the runner of the program has no way to *feasibly* decrypt the data, but of course if they could test all possible encryption keys they could decrypt it.

  6. Imagine the implications this may have on SaaS computing. Companies having sensitive data can realize the benefits of cloud services without the worry of having their data compromised – at least no more so than they currently do utilizing their own data centers.

    • Cloud services are in fact one of the major motivations for fully homomorphic encryption. As I wrote in an earlier comment, it’s too early to declare victory in any practical sense. Right now fully encrypted SaaS is somewhere around a billion times slower than unencrypted computation, and uses about a million times more space. Since the vast majority of SaaS providers charge by CPU time and/or space usage, you can imagine that an extra factor of a million or a billion renders the service uncompetitive.

      • It does have a polynomial size blowup (iirc it’s O(n^3), so “million” and “billion” are not very useful…), but the typical way this stuff works is that someone proves it’s better than exponential, and sooner as opposed to later it is improved to become feasible (n log(n) blowup).

  7. There are encrypted email services on the world, but it looks like ( Lavabit company case) that in the USA they are compromised by government. So technology has nothing important here…

Leave a Reply