Skip to main content
TACAVAR
DevOps Security

How One Server Can Run Whitelisted Commands on Another Without Sharing SSH Keys

Sharing root SSH keys was the obvious answer. I picked something smarter.

The Problem: Two Droplets, One Deployment Pipeline

Tacavar runs on two DigitalOcean droplets. The primary (147.182.209.111) handles public traffic, TLS termination, and the Hermes gateway. The secondary (157.245.80.230) runs internal services: model inference, background job queues, and observability. They share a private network (10.116.0.x), which means traffic between them never leaves DigitalOcean's backbone. That is a security win, but it is not enough.

The deployment flow looks like this: a new build passes tests on the primary, the primary pushes the artifact, and the secondary needs to reload its service to pick it up. The naive approach is to give the primary an SSH key with sudo access on the secondary. The problem is that sudo access is not granular. Once the primary has that key, it can run any command on the secondary. It can read /etc/shadow, wipe the database, or install a persistence mechanism. There is no way to restrict it to “only restart this one service.”

Even if you trust the primary completely today, you are one supply-chain breach, one leaked secret, one misconfigured CI variable away from total compromise. The security model is all-or-nothing, and that is a model that fails catastrophically.

Why SSH Key Sharing Is a Blast Radius Disaster

SSH key sharing fails on three specific axes: granularity, auditability, and revocation.

Granularity: An SSH key authenticates a user, not a command. If the key has sudo access, the holder can do anything sudo allows. There is no native way to say “this key may only run systemctl restart app and nothing else.” You can hack around it with forced commands in authorized_keys, but those break when you need to run multiple different commands, and they are fragile enough that most teams abandon them.

Auditability: SSH logs tell you that a key was used and from which IP. They do not tell you what command was executed, what the output was, or whether it succeeded. If you need to reconstruct an incident timeline, you are left guessing. The secondary's shell history might help, but it is not a reliable audit trail and it is trivial for an attacker to clear.

Revocation: Rotating a shared SSH key is painful. You need to generate a new key pair, copy the public key to every target server, update every CI pipeline and cron job that uses the private key, and then pray you did not miss one. In practice, most teams never rotate shared keys. The same key that was generated two years ago is still sitting in authorized_keys, and nobody knows which services depend on it.

The result is a security posture that looks fine on paper and collapses the moment one component is compromised.

The Whitelist Dispatcher Pattern

Instead of SSH, we built a lightweight RPC dispatcher on the secondary that accepts only explicitly whitelisted commands. The primary sends an HTTP POST to a private endpoint with a command name and a payload. The secondary validates the command against a hardcoded whitelist, executes it via a subprocess with strict timeouts, and returns the output as JSON.

Here is the shape of it:

# dispatcher.py — runs on the secondary server
WHITELIST = {
    "restart_app": ["systemctl", "restart", "app"],
    "reload_caddy": ["systemctl", "reload", "caddy"],
    "health_check": ["curl", "-sf", "http://localhost:8100/healthz"],
}

def dispatch(command: str, payload: dict) -> dict:
    if command not in WHITELIST:
        return {"status": "rejected", "reason": "command not in whitelist"}
    
    cmd = WHITELIST[command]
    result = subprocess.run(
        cmd,
        capture_output=True,
        text=True,
        timeout=30,
        check=False,
    )
    return {
        "status": "ok" if result.returncode == 0 else "error",
        "stdout": result.stdout,
        "stderr": result.stderr,
        "returncode": result.returncode,
    }

The primary sends:

curl -X POST http://10.116.0.3:8900/dispatch \
  -H "Authorization: Bearer *** \" \
  -d '{"command": "restart_app", "payload": {}}'

If the command is not in WHITELIST, the request is rejected before any subprocess runs. If the token is wrong, the request is rejected at the auth layer. If the payload is malformed, the request is rejected at validation. The secondary never executes arbitrary input.

The key difference from SSH: the secondary defines what it will do, and the primary can only request one of those predefined actions. The security boundary is explicit, not implicit.

The Audit Log That Doubles as Documentation

Every dispatched command is logged with a timestamp, the originating IP, the command name, the return code, and a truncated version of the output. The log is append-only and shipped to the primary's monitoring stack within seconds.

This is not just an audit trail. It is documentation. A new engineer can read the whitelist and know exactly which remote operations the primary is allowed to trigger. They do not need to grep through cron tabs or decode a tangle of SSH forced commands. The whitelist is the contract between the two servers, and it is enforced at runtime.

When something breaks, the logs show exactly what was dispatched, when, and with what result. There is no ambiguity about whether the primary sent a bad command or the secondary failed to execute a valid one. The separation of concerns is clean: the primary decides when to act, the secondary decides what it is willing to do.

Adding New Commands Explicitly

The whitelist is intentionally hardcoded, not dynamic. Adding a new command requires editing dispatcher.py, reviewing the command for safety, and redeploying the secondary. This is friction by design.

Dynamic whitelists — where an admin panel or API can add new commands at runtime — defeat the purpose. If an attacker compromises the primary and the whitelist is mutable, they can simply add rm -rf / to the allowed list and execute it. A hardcoded whitelist means that even full compromise of the primary does not grant arbitrary command execution on the secondary. The attacker is limited to the predefined set, and that set does not include anything dangerous.

The deployment overhead is minimal. In practice, we add a new command to the whitelist once every few months. The thirty-second edit plus redeploy is cheaper than the ongoing risk of a dynamic system.

What Happens If the Secondary Server Gets Compromised

The whitelist protects the secondary from a compromised primary. It does not protect the primary from a compromised secondary, but that is a different threat model and the architecture handles it.

The dispatcher endpoint is bound to the private network interface (10.116.0.3:8900). It is not reachable from the public internet. The bearer token is rotated via environment variables that are injected at deploy time, not checked into source control. Even if an attacker gains access to the secondary, they cannot use the dispatcher to attack the primary because the dispatcher is a request handler, not a client. It does not initiate connections back to the primary.

If the secondary is fully compromised, the attacker can read the dispatcher token and the whitelist. That is bad, but it is localized. They cannot use that token to pivot to the primary because the primary does not expose a matching endpoint. The blast radius is one droplet, not two.

The Result

We have been running this pattern for months across the Tacavar stack. Deployments are fully automated. The primary triggers service restarts, config reloads, and health checks on the secondary without any shared SSH keys. The audit log has never been needed for an incident, but it has been used twice to debug deployment failures where a command returned a non-zero exit code and the JSON response made the root cause obvious in seconds.

The security model is not perfect — no model is — but it is explicit, auditable, and bounded. The primary cannot do more than the whitelist allows. The secondary cannot be commanded by anything that is not on that list. The contract is in code, not in configuration, and code is easier to review than a tangle of SSH keys and sudoers files.

If you are running more than one server and you are sharing SSH keys between them, you are one leaked secret away from a very bad day. There is a better way. Define what each server is willing to do for the others, enforce it at the boundary, and log every crossing.

You built it. We optimize it.


Tacavar designs secure automation pipelines for distributed infrastructure. See our DevOps capabilities at tacavar.com.


Secure Your Infrastructure

Tacavar designs secure automation pipelines for distributed infrastructure. From whitelist dispatchers to least-privilege orchestration, we build systems that survive real compromises.