Encrypting Solr/Lucene indexes

Encrypting an index

 In data-sensitive situations, Lucid has fielded some requests to encrypt the index on the assumption that, if the datacenter is somehow compromised and the raw Solr/Lucene index is copied somewhere, an encryption scheme will keep the data secure. For instance, consider data that falls under HIPAA regulation, or financial data, or crime investigation data. This is one of those requests that seems reasonable from a high level, but quickly becomes unmanageable while providing a false sense of security. 

This really falls into two phases transmission and indexing.

Transmission

The first concern is that the indexing process requires that the documents be sent to the data center. If the data is hosted in the cloud, there is often additional concern that someone could eavesdrop on the conversation and acquire sensitive data. This is a problem faced in any data-transmission process, and is best approached by techniques already in place to secure network communications, starting with https and progressing from there. The transmission issue is outside of Solr's purview.

Indexing

 The second part of the equation is what actually gets stored on disk in Solr. A brief review of the concerns is in order. There are a series of files that make up a basic Solr index. The two most interesting are the terms and the stored data. Both can be used to reconstruct a document.

Terms are stored in our classic inverted list, for each term there is a list of documents and, optionally, the position within that document. It's quite possible, though tedious, to reconstruct a document from the list of terms in the index. They may not be easily readable by a human because the terms have been through the entire analysis chain; think of stemming, synonym substitution, any of the transformations, and more, listed here. While these can be hard to glance at and read, the document so reconstructed should be considered complete; sensitive information is available.

The stored data is much easier to make sense of. Any field in Solr marked as ' stored="true" ' is placed verbatim in the *.fdt files. This is a literal copy of the input, no transformations are applied and all the data for a document is stored together. For an index which has no data stored, these files will be empty.

Naturally, given these two ways to reconstruct documents from an index, the question arises "can't we encrypt both"? Sure, we can. It's just that doing so leads to some surprising results, often making the resulting index next to useless.Here's why.

A decent encrypting algorithm will not produce, say, the same first portion for two tokens that start with the same letters. So wildcard searches won't work. Consider "runs", "running", "runner". A search on "run*" would be expected to match all three, but wouldn't unless the encryption were so trivial as to be useless. Similar issues arise with sorting. "More Like This" would be unreliable. There are many other features of a robust search engine that would be impacted, and an index with encrypted terms would be useful for only exact matches, which usually results in a poor search experience.

There are more subtle problems as well. Building an encryption/decryption scheme into your Solr installation requires the individual tokens be encrypted. A static key-based algorithm will produce the same outputs for given inputs. But since what's being indexed are individual words (tokens), a statistical analysis of a reasonably-sized corpus should map fairly directly to a list of words by frequency. The fundamental problem is that encryption works well by convolving a large dataset to increase "entropy". When you break data into teeny bits and encrypt those, you have no entropy. In particular, if you save term counts in documents, Zipf's Law says there will be a simple curve of word frequency, and this makes words easier to guess. That doesn't mean that the process is necessary easy, but it's do-able.

Short form

The short form is that this is one of those ideas that doesn't stand up to scrutiny. If you're concerned about security at this level, it's probably best to consider other options, from securing your communications channels to using an encrypting file system to physically divorcing your system from public networks. Of course, you should never, ever, let your working Solr installation be accessible directly from the outside world, just consider the following: http://server:port/solr/update?stream.body=<delete><query>*:*</query></delete>!

 

Have more questions? Submit a request

3 Comments

  • 0
    Avatar
    Lance Norskog

    Right- the fundamental problem is that encryption works well by convolving a large dataset to increase "entropy". When you break data into teeny bits and encrypt those, you have no entropy. In particular, if you save term counts in documents, Zipf's Law says there will be a simple curve of word frequency, and this makes words easier to guess.

  • 0
    Avatar
    Erick Erickson

    OK, I dressed up the "encrypt an index" article, is it worth publishing? Expanding? It's not a very deep one, but maybe it'll do.

  • 0
    Avatar
    Gordon S

    I get that encryption isn't going to work - but what about cryptographic signatures or applying an HMAC to blocks of data? That wouldn't provide confidentiality, but would provide integrity, and a degree of tamper-proof/evident protection?

Please sign in to leave a comment.
Powered by Zendesk