Zero Trust Architecture in the Cloud

Trust management is one of the core challenges in server infrastructure. The traditional approach (defining a secure perimeter and automatically trusting everything inside it) is no longer considered safe. Now, every entity in a system has to decide whether another entity can be trusted and to what degree.
Architecture has to be built under the assumption that any device or workload could be compromised. In practice, this means trust can no longer be granted by default merely on the grounds that a device or connection exists within the perimeter.
This idea became the foundation of the architectural model known as Zero Trust.
Identity as the New Perimeter
The old paradigm that “if you’re inside, you can be trusted” has now become a liability. In modern cloud architecture, implicit trust creates ideal conditions for attackers. A single compromised device with access inside the network is often enough to enable lateral movement and privilege escalation.
The only way to defend against this is to completely rethink the criteria that justify trust. In a Zero Trust architecture, network location alone is no longer enough. Instead, identity becomes the central factor that is verified on every request.
A system that relies on this type of architecture requires continuous discovery and monitoring. Every connected device, from a laptop to a smart lightbulb, has to be spotted and isolated before it can expand the attack surface. And since equipment fleets tend to grow fast, scalability has to be built into cloud services from the get-go.
The same logic applies to users. Here, identity remains central, as well, but the number of parameters depends on the context of the request and behavioral signals. It’s no longer enough to know who is connecting (an employee, a member of maintenance staff, a contractor’s representative, or similar). The system also needs to consider how they were authenticated and what level of access should be assigned to them.

From Role-Driven to Attribute- and Policy-Driven Access
In a corporate environment, access control is still implemented through ordinary RBAC (Role Based Access Control). It remains popular because it’s easy to implement. Plus, the necessary tools are widely available, and it’s simple to audit.
Architecturally-speaking, though, it becomes less flexible as systems grow more complex. Real-world needs rarely match rigid definitions, which leads to a mass proliferation of roles like manager-sales-readonly-eu-prague. Without creating these specialized roles, organizations struggle to grant precise permissions.
That’s why they have started adopting ABAC (Attribute-Based Access Control), which makes decisions based on the attributes related to the user, their environment, and resource.
PBAC (Policy-Based Access Control) is another model increasingly used in Zero Trust architectures. Here, access logic is governed by centralized policies. However, even though this approach aligns well with the Zero Trust concept, it increases management complexity.
In the end, you have to consciously balance security and operational cost.
Security Between Services in Cloud Architecture
Up to this point, the discussion has mostly focused on the traditional interaction between users and devices.
In cloud infrastructure, though, you’re more likely to encounter coordination between different systems with no human involvement. Microservices communicate with each other directly without the need for an intermediary. Background tasks process data from queues, and a pod in Kubernetes requests the secrets it needs to operate.
For Zero Trust, each of these interactions is treated as a full-fledged access attempt rather than an internal “safe” action. Even if services reside inside the same VPC or Kubernetes cluster, that alone is not considered proof of legitimacy. In traditional infrastructure, the logic was that requests that came from a private subnet could be trusted. In the cloud, such an assumption no longer holds; in fact, it’s considered dangerous.

A private network can contain anything: from a forgotten test service to a compromised container. This creates a dormant threat that can materialize at any moment and that has to be brought under control in advance rather than when discovered.
When building Zero Trust, inter-service interaction is generally based on three simple principles:
- Identity for every service: A hostname or IP address is not sufficient. Instead, this has to be a cryptographically verifiable entity: for example, a SPIFFE ID.
- Authentication on every call: It’s not enough to verify that the network connection exists. The system also needs to confirm whether the caller really is who they claim to be and whether they have the right to perform the specific action. The request can also be verified against expected scenarios.
- Encryption of traffic even within the private network: In modern cloud architecture, workloads are dynamic (created, moved, destroyed). As a result, unencrypted traffic becomes a weak point. To reduce the risk of interception and spoofing, mutual TLS authentication has to be applied.
The key idea behind all three is that every service in the cloud must behave as an ordinary user, proving its right to perform specific actions and access particular corporate resources.
How Access Rules Are Determined
In modern Policy-as-Code implementations, the separation between PDP (Policy Decision Point) and PEP (Policy Enforcement Point) remains relevant. Services do not make authorization decisions on their own. Instead, they delegate them to an external engine.
In most cases, the role of such an engine is played by OPA (Open Policy Agent) paired with the declarative language Rego. This combination is widely adopted across the CNCF ecosystem, from Kubernetes to Envoy.
An alternative to Rego is Cedar, an open authorization language originally developed by AWS, now under the wing of the CNCF Sandbox. It was designed with formal verification in mind and supports modeling of RBAC, ABAC, and even ReBAC (Relationship-Based Access Control) approaches.
ReBAC, in particular, is another frequently used paradigm, where decisions are derived from a graph of relationships between resources, groups, and subjects. Another notable implementation of ReBAC is SpiceDB, an open source database developed by AuthZed. Its creators drew inspiration from Google’s Zanzibar, a global authorization system that’s used across numerous Google services. SpiceDB is particularly convenient in situations when systems need to answer questions such as, “Can user A perform action B on object C?”
All of the tools listed above make it possible to manage security policies in the same way as program code. In fact, policies have become just as important a development artifact as the application’s source code itself. At the same time, traditional methods of administration, such as logging, should not be forgotten. Recording every PDP decision makes sense, since this is precisely what will prove vital for monitoring and incident tracking in the end.
Secrets Management
DevOps engineers have long struggled with building a system capable of securely storing credentials and delivering them confidentially. Each of them has repeatedly fallen into the same pattern: storing secrets in an environmental-variable file or commands inside a Dockerfile. For the information security team, this is a predictable failure, one that will occur with 100% probability.
Centralized secret stores such as HashiCorp Vault are a common and widely used option. This model enables the use of dynamic, short-lived secrets. Their value is greatly diminished here since they will have already “gone stale” by the time the attacker has reached the application stage. The downside here is that, to access the contents of Vault, you need a secret of your own, which has to be stored somewhere and delivered safely.

If you use a SPIFFE ID as the basis, you effectively eliminate the “secret zero problem.” No embedded secrets are required, as everything is proven through an SVID (SPIFFE Verifiable Identity Document). A local agent issues a cryptographically verifiable identity based solely on the properties of the workload.
It’s now sufficient to present this SVID directly to the secrets store. Vault will verify the signature, map the SPIFFE ID to a specific policy, and issue a short-lived token. Again, the value of any stolen token is greatly reduced since its validity window is now much smaller. There is no longer “zero secret problem” in such a chain, which is consistent with the Zero Trust paradigm.
This pattern can be implemented in several ways. For instance:
- Previously, JWT auth with SPIRE OIDC Discovery Provider was a common combination.
- Third-party plugins such as vault-auth-spire (now archived) could serve as an alternative.
- Modern Vault Enterprise versions now feature a built-in SPIFFE method as well as a native SPIFFE secrets engine, which allows Vault itself to issue SVIDs (a corresponding license is required). In some scenarios the feature can eliminate the need for a separate SPIRE deployment, but bear in mind that it does not replace a full-fledged workload attestation model in complex server infrastructure.
Things aren’t quite so ideal in practice, though. If you have the open-source Vault, you will have to use not only JWT auth + OIDC Provider but also properly configured SPIRE Federation API. This works well within a single cluster or cloud. However, across environments, this becomes more complex, as you’ll have to take care of ingress, TLS, and DNS.
Observability in Zero Trust Systems
In addition to the classic combination of metrics, traces, and logs, you gain the so-called fourth dimension: authentication and authorization events. Every PDP decision, as well as every successful or denied access to a secret, becomes a signal that can be used to detect problems. In cloud-native environments, OpenTelemetry has become the de facto standard tooling.
One of the main challenges for administrators is tying together disparate data, such as decision logs, Vault audit logs, distributed traces, and so on into a single picture. Achieving this typically requires the installation, configuration, and maintenance of a whole stack of different tools (Prometheus + Grafana, ELK/OpenSearch, Tempo/Jaeger, SIEM, and so on).
A useful piece of advice here is to look at what managed services your cloud provider offers. In many cases, these can cover most of the requirements listed above. But keep in mind that none of these aspects should be neglected.
Zero Trust is not about maintaining a constant level of trust. It’s about continuous re-evaluation based on input signals from behavioral analytics. A new geolocation, an atypical request volume, or privilege escalation can all serve as triggers for a response.
Proxy Solutions Instead of VPN
Traditional VPNs (Virtual Private Networks), where users gain broad access to a network segment after connecting, fits poorly with Zero Trust. This is precisely that “trusted zone” mentioned above that we are trying to distance ourselves from.
The ZTNA (Zero Trust Network Access) strategy takes a different approach. Rather than granting access to a network segment, it grants it to a specific resource, and only after identity and context have been verified on each request. From a software standpoint, this is typically achieved through applications that implement the request-proxying concept. Fortunately, both commercial and open-source solutions are now available on the market.

When it comes to commercial options, Cloudflare Access stands out as a service based on Cloudflare’s edge network. It’s more than a separate proxy; in fact, it’s a whole platform that combines an implementation of ZTNA with a SWG (Secure Web Gateway), CASB (Cloud Access Security Broker), and optionally RBI (Remote Browser Isolation). It’s particularly suitable for those organizations that are already on Cloudflare and want to replace VPN quickly without breaking anything. The main downside is vendor lock-in.
Similar capabilities are available on platforms like Zscaler and Netskope. These are also full-fledged platforms where ZTNA is just one of the elements.
As for self-hosted setups, here are the solutions worth mentioning:
- Teleport provides SSH/RDP/DB/K8s through a proxy, with session recording and short-TTL certificates.
- HashiCorp Boundary is a typical representative of an IAP (Identity-Aware Proxy), which first authenticates the user and only then grants access to the target system.
- Pomerium is another IAP, but it’s more oriented toward access to web applications and systems via HTTP/HTTPS. In this respect, it’s somewhat closer to a reverse proxy.
Conclusion
Zero Trust isn’t a product that can be purchased off the shelf. It’s an architectural approach, and its adoption almost always happens over multiple painful stages.
Building it on top of legacy code without some great changes is practically impossible. For instance, the DevOps team may need to adapt its practices, and developers often have to rewrite applications.
In practice, though, partial implementations are common. At best, they’ll manage to cover one thing (such as the network layer), while identity and observability will be left behind. In such cases, proving the correctness of access at the application level will remain just as challenging.
For new cloud systems, the Zero Trust approach should be treated as a baseline architectural norm. However, when it comes to existing services, adoption requires enough time for thorough planning and phased execution. Such a transition simply can’t be completed in a couple of sprints, even with the most advanced and cooperative team.