AlvorAdvisory

HPC Security · NIST SP 800-223

Security that respects the speed of the machine.

A supercomputer is not an endpoint, and the controls that protect a fleet of laptops will quietly wreck it. A monitoring agent on every node desynchronises the collectives a parallel job is built on. An inline firewall has nothing to inspect on a kernel-bypass fabric and only adds the latency the fabric exists to remove. We secure high-performance computing and research workloads the way they are actually built: the scheduler, the fabric, the parallel filesystem, and the login plane, with the strong-scaling that justifies the machine kept intact.

Book a consultation How an engagement runs

NIST SP 800-223 reference architectureSLURM · MPI · RDMA fabricStrong-scaling preserved

The tension

A supercomputer is not an endpoint.

Enterprise security is built for a world of endpoints: a laptop you can put an agent on, a connection you can terminate and inspect, a user at a keyboard with cycles to spare. An HPC system breaks every one of those assumptions. It is a single tightly-coupled instrument, thousands of nodes wide, where the whole point is to keep the processors saturated and the data moving without a single avoidable detour. The controls that keep an office estate safe do not transplant onto it. They have to be redesigned around how the machine actually works.

What enterprise security assumes

An endpoint it can install an agent on
Spare cycles to run that agent in
A network path it can inspect inline
One user, on one machine, in one session
Latency it can afford to measure in milliseconds

What an HPC system actually is

Thousands of nodes that exist to run flat out
Every cycle accounted for against an allocation
A kernel-bypass fabric with nothing to inspect
Hundreds of users sharing nodes and a filesystem
Collectives that break when one rank is late by microseconds

Why generic security fails here

Five ways a standard control breaks a cluster.

None of this is an argument against securing HPC. It is an argument for knowing where the controls go. Each of these is a real way a well-meaning enterprise control, applied unchanged, takes throughput off the machine or leaves the actual exposure untouched.

OS noise

A monitoring agent that wakes on its own schedule desynchronises the whole machine.

Tightly-coupled jobs advance in lockstep through collective operations, an MPI_Allreduce or a barrier, and the slowest rank gates every other one. A host agent that steals a few microseconds on a single node, on its own cadence rather than the application's, ripples into idle time across thousands of ranks. The effect is worse at scale, not better: the per-node noise stays constant while the cost of staying synchronised grows. This is the OS-jitter problem the field has measured since the early 2000s, and it is the first reason you cannot simply push the enterprise endpoint agent down to every compute node.

Petrini, Kerbyson & Pakin, “The Case of the Missing Supercomputer Performance”, SC 2003.

Kernel bypass

The fast path never touches the kernel, so an inline appliance has nothing to stand in.

RDMA over InfiniBand, RoCE, or HPE Slingshot moves data from node to node without copying it through the host operating system. That kernel-bypass is the entire point: it is where the latency went. Put a stateful firewall, a TLS-inspection appliance, or deep packet inspection into that path and you are either inspecting traffic that routes around you or reintroducing exactly the overhead the fabric was designed to remove. Segmentation on a high-speed fabric is a design problem, solved in the topology and the scheduler, not a box you rack inline.

The I/O tax

Encrypting every read and write taxes the exact path strong-scaling depends on.

A parallel filesystem exists to feed thousands of clients at once. Layer per-operation encryption or inline content inspection into that path and throughput is the first thing to go, on precisely the workloads that justify the cluster. The data still has to be protected. It just belongs where it does not sit between the compute and the bytes: at the storage controller, on self-encrypting media, and in a disciplined staging design rather than in the hot read-write loop.

Agent sprawl

An agent on every node is licence cost times the node count, and noise times the node count.

What is routine across a fleet of office laptops becomes a different proposition at thousands of identical nodes that exist to run at full tilt. The cost multiplies, the jitter multiplies, and the telemetry volume multiplies, until the monitoring is competing with the science for the machine it is meant to protect. Visibility on HPC has to be earned with a lighter footprint: out-of-band collection, sampled and eBPF-based instrumentation, and telemetry drawn from the fabric and the scheduler rather than a heavyweight agent fighting the kernel on every host.

Shared trust

One static key on a shared login node is the intrusion the sector has already lived through.

Hundreds of users share login nodes and a common filesystem, and for years they authenticated with static SSH keys that travelled freely between sites. In 2020 a wave of intrusions swept European academic supercomputing centres: stolen credentials, lateral movement across the shared estate, and clusters quietly turned to mining cryptocurrency. The lesson was not to add an agent. It was to move to short-lived SSH certificates, multi-factor on the access zone, and isolation designed on the assumption that a login account will eventually be compromised.

Performance-aware by design

A control you cannot benchmark is a control you cannot trust.

Securing HPC is rarely a question of whether a control is worth having. It is a question of where it goes. Every control has a place on the critical path, and the work is to choose the ones that protect the system without standing between the compute and its throughput. We baseline the machine first, design the controls to sit off the hot path, and prove the throughput held afterwards. If a control costs you strong-scaling, it is either the wrong control or it is in the wrong place.

1We benchmark before and after. A control ships when the regression on your real workloads is measured and accepted, never assumed away.
2Security moves off the critical path: DPU and SmartNIC offload, out-of-band collection, encryption at the controller rather than in the I/O loop.
3The scheduler does the isolating it already knows how to do, through cgroups, per-job constraints, and hardened prolog and epilog, rather than an agent fighting the kernel on every node.

How the work runs

Wherever your cluster is, we pick it up from there.

HPC security is the same four delivery tracks as the rest of the practice, with a performance baseline built into every one. Each stage ends on a decision that stays yours: carry on with us, take it in-house, or stop where you are.

01Assess

Baseline the machine, then map the gaps.

We benchmark the workloads that matter and walk the four zones looking for the soft spots: the flat management network, the static keys, the data-transfer nodes nobody segmented. You get a threat model, a performance baseline, and a ranked list of what to fix first.

HPC threat modelPerformance baselinePrioritised gap register

Explore Assess

02Architect

Design the zones before anyone reconfigures a node.

We design the target architecture on NIST SP 800-223: the zone boundaries, the identity model, the scheduler isolation, and the research enclave for regulated work, with every control placed against its cost on the critical path.

Zoned target architectureControl-to-critical-path mapResearch enclave design

Explore Architect

03Build

Stand it up with your HPC team, not around them.

We implement alongside the people who run the cluster, with throughput as a release gate: a control ships when the benchmark confirms the science still runs as fast. Nothing is marked done on a closed ticket alone.

Controls implementedThroughput regression gateValidated on the benchmark

Explore Build

04Operate

Keep it secure as the machine and the science change.

Allocations turn over, nodes get added, new workloads arrive. We keep the monitoring jitter-aware and out of band, re-benchmark on a schedule, and re-test the zone boundaries so the posture does not drift as the cluster grows.

Jitter-aware monitoringScheduled re-benchmarkBoundary re-testing

Explore Operate

The reference architecture

Four zones, each with a job and a threat model of its own.

We design to the NIST SP 800-223 reference architecture, which treats an HPC system as four zones rather than one flat trusted estate. Each zone gets the controls that fit it, and the boundaries between them are where the security actually lives. It is also how a single login account stops being a path to the whole machine.

Access zone

The front door

Login nodes, data-transfer nodes, and the science portals. The exposed surface, designed as a Science DMZ so the bulk data path stays fast while the front door itself moves to multi-factor, short-lived SSH certificates, and federated research identity.

SSH certificate authorityCILogonOpen OnDemandGlobus / DTN

Management zone

The crown jewels

Provisioning, scheduling, monitoring, and identity, plus the out-of-band BMC and Redfish plane that can power and reimage the whole machine. We isolate it, harden the SLURM control path and its MUNGE trust, and keep it off any network a running job can reach.

SLURM control pathMUNGEBMC · IPMI · RedfishProvisioning

Compute zone

The hot path

The compute nodes and the high-speed interconnect, where throughput is the whole point. Segmentation by design rather than inline appliance, per-job isolation through the scheduler, and confidential-computing isolation where a sensitive workload genuinely needs it.

InfiniBand · Slingshot · RoCEcgroup job isolationSEV-SNP · TDXNode attestation

Data zone

What feeds it

The parallel and campaign storage that keeps the processors fed. Protection that stays out of the I/O path: controller-level and self-encrypting-media encryption, project-scoped access, root-squash on the exports, and staging that keeps regulated data where it belongs.

LustreIBM Storage ScaleBeeGFSWEKA · VAST

Research compliance

The regimes that bind research, mapped to the cluster without breaking it.

Research runs under obligations a generic compliance program rarely meets cleanly: export control, controlled unclassified information, human-subjects data, and the research-security expectations now written into federal funding. The usual mistake is to drag the whole centre into scope to satisfy one regulated grant. We map the obligation to the architecture instead, most often through a research enclave that carves the regulated workload out of the open-science estate, so the controls land where the sensitive data is and the rest of the machine stays fast and open.

Every engagement maps back to one control set, so the work also evidences the ISO 27001, SOC 2, and NIST CSF posture the wider institution answers to.

NIST SP 800-171 · CMMC 2.0Controlled Unclassified Information in defence-funded research, and the certification its supply chain now turns on.
NIST SP 800-53 · FISMAFederal research systems and the moderate or high baselines a funded program is held to.
ITAR · EARExport-controlled research, and the deemed-export problem a foreign national on a shared cluster creates.
HIPAA · NIH dbGaPHuman-subjects data and controlled-access genomic datasets, with their own use and storage conditions.
NSPM-33The research-security program expectations now written into the terms of federal funding.