Can Hashicorp Vault be run as a local process instead of a central server?Published on 2020-05-20
I've been learning about Hashicorp Vault lately. I left my job at AWS a few weeks ago, so I'm now wading through the ocean of open source tools that have popped up since 2012 when I joined Amazon. I've been focusing on key management and client-side encryption for the last couple of years at work, so this is a fun investigation for me.
I can see why Vault is popular. Compared to most secret management software, it's a joy to to use.1 The documentation is clean, the APIs are friendly, and it supports a long list of integrations. It feels well made.
But, there's one area of Vault that makes me nervous, and that is the model of a centralized key management instance. Does this really, really have to be a centralized server?
HSMs are fickle beasts (even software ones), and encryption doesn't leave a lot of room for error. If you make a little oopsie with the data it's gone, so you have to be very, very careful. You need to develop an entire culture of carefulness. Mistakes are going to happen, and that's when you get to test your preparation.2
Vault is a single server application
Vault runs on a single server.3 That's it.
This isn't a bad thing! Quite the opposite in many cases. Simplicity is good, and it neatly solves the nasty issues that pop up with data consistency. You often don't need to grab secrets from your backend that often, so a single server is probably more than enough. You configure a location to store encrypted secrets (file, database, S3) and Vault gates access.
But past a certain scale, centrality starts to cause problems. Now, I haven't administered Vault before. But I've spent time building and administering fault tolerant distributed secret management infrastructure, and we bent over backwards4 to avoid single points of failure.
Secret management is easy to get wrong. There are a few reasons I'm not in love with putting all your secrets on the same host. It's tempting because it gives you a single host to try to do a great job hardening. But you're putting a lot of risk on your defenses, the people administering them, and the fine engineers at Hashicorp. If any of that goes wrong you've managed to leak secrets not just for one service but for all services.
You're also obviously shouldering significant availability risk. Maintenance is going to be touchy.
So, set up a few Vaults. One for the really secret stuff, one for operational stuff. Lower the blast radius. The instances you do run, run them in High Availability (HA) mode. All reasonable practices, but you're still pinning your hopes on a single server. If your read traffic outscales the largest host you have access to you're done. You can no longer use Vault without upgrading to Enterprise plan to get hot read-replicas for true horizontal scaling on the read path.
But does it have to be a single instance application?
Vault acts as sort of a window into encrypted data. It holds keys, a storage backend holds data, and between the two a third party can be granted access to secrets.
Let's imagine that my web host needs the database password, and my central Vault instance is hosed. Can I spin up a local Vault instance and get secrets out of the backend?
Yep, I can. Should I? Let's dig in.
I spent the weekend playing with this for fun, and that's the topic of this post. I'm going to walk through what I tested, and discuss whether you can talk Vault into running as a local process per-host instead of a centralized one.
Warning! I'll jump right to the punchline; resistance to memory profiling is just not part of the Vault security model. This is not supported by Hashicorp, and it's a bad idea. But isn't bad ideas what weekends are for?
A normal centralized Vault deployment
Let's review a normal Vault deployment -- the kind of thing you'd get if you followed the Hashicorp documentation.
Okay, this is a normal pattern in secret management. Vault doesn't persistently store data5, it controls access to data by gating the ability to decrypt it. Vault acts like a virtual HSM, and it provides robust access controls to the data it is protecting with a TON of flexibility6. By choosing between a bunch of different plugins for auth, encryption, and storage you can twist Vault into whatever configuration you want.
You want to use an on-prem HSM that speaks PKCS11 as your seal mechanism but then use Oracle cloud as your persistent data store fronted by AliCloud auth? Go for it. Vault has your back.
On top of that, you can enable many types of secret engine, which means Vault can store anything from simple key value pairs to instance credentials.
A distributed Vault deployment
Okay, let's just not run a central Vault at all! We'll run a new Vault every single time we start a new host that might need secrets.
How do you unseal the vault? Don't you need to do that operator unseal with the Shamir keys? Nope, use
awskms and autounseal. It also gets you a nice noisy and immutable audit log via CloudTrail every time someone unseals Vault. If you use an encrypted S3 bucket or other encrypted AWS resource as the backend with your own KMS CMK then you also get audit logs for read or write operations Vault performs.
At this point each host can launch its own mini-Vault server and request secrets to its heart's content. Access to secrets will be gated by the token you allow the read node to auth with, so the instance will be prevented from seeing loading things it shouldn't. (Or that's the theory, at least.)
How do we store secrets if there isn't a central Vault?
Vault wasn't really storing the data, right? So we run admin nodes. Run them wherever it makes sense. Do it wearing your secret ceremonial robes in your underground bunker on a fresh laptop. Make an admin instance inside your VPC fronted by a bastion requiring 2FA and keep the token in the belly of a dragon. Update it from your dev machine willy nilly and don't document the process. The point is to make a new Vault instance, ephemeral or permanent, and login with an elevated role.
What about data consistency?
Okay, so let's say I fire up my admin Vault on my secret bunker laptop, and I add a secret to
secret/foo that the read-only node will have access to. When comes online it auto-unseals and logs in with its token.
This node can now read the secret that I put in there with my admin credentials! Does that mean it worked? Not so fast. If I hop back onto the admin role on my laptop and update the secret it won't update on the node!
Well, that's either a deal-breaker or not a big deal depending on why we're running Vault in the first place.
You can bounce the thing; it will pick up changes as soon as it restarts. If you just wanted to call it once at startup, you're in business. If you need to ask it for fresh secrets frequently, you're in a bit of a pickle unless you're cool just shooting the Vault process at some known interval. (Built-in caching! Look at that!)
How unsafe is this?
There's a big difference between allowing untrusted hosts access to the Vault process via TCP and letting an less-trusted user on the host. Vault is just software after all, and secrets are hard to hide in memory.
Again, Hashicorp explicitly states in their security model that Vault is not hardened against memory analysis, and recommends that you disable core dumps when running in production.
Is memory analysis actually a practical threat?
Yup. It's trivial for a user with elevated permissions.
That'll do it.
Let's be clear, this is not a bug! I just ran
gcore against the poor thing using sudo. My point isn't that Hashicorp did something wrong, but that you should be careful with storing secrets in software.
In case it wasn't already clear that running Vault on untrusted hosts and relying on software permissions is a bad idea, I hope that underlines the matter.
The good news is that the process does not seem to be loading secrets that it hasn't handled, which is exactly what I'd hope. I reset Vault and reran the core dump without having recently handled
imalittleteapot and it was not present.7
I want to draw your attention to that coredump8. I've done it a few times, and based on the timing you'll get slightly different results. This time I happen to have captured a helpful JSON object that has a number of useful fields. The one I'm really interested in is
force_no_cache. That could be what we need to run Vault without it keeping anything in memory!
After a bit of doc plumbing I was able to enable the
kv secrets engine with
force_no_cache set to true and repeat my little stunt.
No luck. It still dumps the key, although the flag does now reflect that it should not be cached. I suppose we captured the flag in quite a few places, so this might take more hardening than is reasonable if I'm going to keep tossing sudo at it. It appears that my dreams of running tiny Vaults everywhere are well and truly dashed.
If this distributed model sounds attractive you might just consider using something like Mozilla SOPS or the AWS Encryption SDK and cut out the moving pieces entirely. Download a new encrypted file when something changes. If it's infrequent, consider polling at an interval that matches the warmness you need from your data.
I've seen a couple of posts discussing alternatives to Vault lately. My take is slightly different. Vault's ease of use means that it isn't hard to get up and running, but taking on secret management is a huge decision that I suspect many teams aren't ready for when they pull that trigger.
If you're going to use Hashicorp Vault, it seems to me that it should be for its power and flexibility, not for its simplicity. If you need a strongly consistent centralized secret store, Vault seems great. The point I want to drive home is that running centralized secret management infrastructure is never simple.
If you can avoid moving pieces like a stateful web host, avoid them. Vault might be essential for a complicated multi-cloud or hybrid cloud deployment. But If you're running single-cloud on AWS, consider whether you need all that flexibility. You might just let the S3 team deal with availability and durability, and let them delegate security over to the KMS team.
Most teams are not really interested in competing with S3 in a battle of nines, nor KMS in a battle of literal vaults and world class security practices. Running world-class secret management is hard and expensive, and it's probably not what your business is good at.
Vault has made it easy to get started with secret management, but that's just the beginning.
I once spent many days in a hot and loud
closet "secure server room" playing with Chinese HSMs while using Google Translate to figure out what in the world they were doing. This seems great! It's possible secret management software is not the most joy-inducing category.
I spent two years on the AWS KMS team. Talk about a central service! It sits smack dab underneath nearly every single AWS service. That is one stressful oncall rotation. To their credit, I never saw a single serious outage or instance of data loss.
High Availability (HA) deployments are still centralized; it's a warm standby model. Vault gets some friends, and keeps them in the loop. If Vault doesn't show up to work one of her buddies steps in and puts on the Vault hat. There's also a brand new Raft consensus model (fancy!) and a centralized read-replica configuration for Enterprise customers. Hashicorp is with me; Vault needs horizontal scaling. But these are all still centralized-ish as far as I can tell. The brand new Raft storage engine from Vault reads to me like distributed durability via consensus, and a great fleet of warm standby hosts, but a single leader serving traffic. I haven't gone deep on Raft though, and it looks awesome.
I've had to measure redundancy in "number of top-of-rack routers" because there was only one per rack, therefore redundancy is measured in racks, therefore you have three racks (slash top of rack routers) at a minimum. "Is that too many HSMs" is a separate problem.
Though it does hold it in memory for some amount of time. I haven't figured out how to control this.
Is flexibility good? All that flexibility leaves you a number of overlapping security controls, and it's up to you to make smart choices. On a philosophical level that makes me a little nervous, but it's not a bad thing. It's a whole lot of room for users to make poor choices, and it smells a little like JWT or the expansive TLS 1.2 cipher list. Great choices in there, careful not to blow your foot off. It's hard to argue with the results though, and Hashicorp provides reasonable guidance in their documentation.
You might argue that since I only proved you can dump recently handled secrets that this doesn't really matter -- the host was going to have access to the secrets anyway, right? Welll, maybe. I was lazy and I got a secret I knew was in there. If the master secrets are extractable, and I assume they are for someone determined enough, then we have real problems on our hands.
I didn't notice this until reviewing my own post.
A coredump is not the most traditional path to discover relevant documentation, but I'll take it.