Skip to content

Conversation

@MikeZappa87
Copy link
Contributor

@MikeZappa87 MikeZappa87 commented Nov 21, 2025

SPIRE Delegated Identity API Integration for ztunnel

Overview

This document describes the design and implementation of SPIRE integration in ztunnel using the Delegated Identity API. The implementation supports two attestation modes: Selector-based and PID-based, each with different security and efficiency trade-offs.

Background

Current ztunnel Certificate Management

The existing ztunnel certificate management uses SecretManager to cache certificates by Identity (SPIFFE ID). When multiple pods share the same service account, they share a single cached certificate, reducing CA calls and memory usage.

SPIRE Delegated Identity API

SPIRE's Delegated Identity API allows a trusted delegate (ztunnel) to request certificates on behalf of workloads. The API supports two attestation methods:

  1. Selectors: Identify workloads by Kubernetes namespace + service account (We are not using this as its Spire specific)
  2. PID: Identify workloads by their process ID for stronger attestation

Design

Attestation Modes

In PID mode, each workload is attested individually using its container process ID. This approach:

  • Provides stronger security through per-workload attestation
  • Each pod receives its own certificate from SPIRE
  • Higher SPIRE server load and memory usage
  • SPIRE verifies the actual running process, not just Kubernetes metadata
┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│  Pod A (PID 1)  │     │  Pod B (PID 2)  │     │  Pod C (PID 3)  │
└────────┬────────┘     └────────┬────────┘     └────────┬────────┘
         │                       │                       │
         ▼                       ▼                       ▼
┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│  Certificate A  │     │  Certificate B  │     │  Certificate C  │
│ (SPIRE call #1) │     │ (SPIRE call #2) │     │ (SPIRE call #3) │
└─────────────────┘     └─────────────────┘     └─────────────────┘

CompositeId Design

Motivation

The original SecretManager used Identity as the cache key. To support PID-based attestation while maintaining backward compatibility with the existing CaClientTrait interface, we introduced CompositeId<RequestKeyEnum>.

Structure

pub struct CompositeId<RequestKeyEnum> {
    id: Identity,           // The SPIFFE identity (ns/sa)
    key: RequestKeyEnum,    // Distinguishes individual workloads
}

pub enum RequestKeyEnum {
    Identity(Identity),     // For backwards compability
    Workload(WorkloadUid),  // For PID mode: key by workload UID
}

Trade-offs

This design was chosen to maintain backward compatibility with CaClientTrait:

#[async_trait]
pub trait CaClientTrait: Send + Sync {
    async fn fetch_certificate(
        &self, 
        id: &CompositeId<RequestKeyEnum>
    ) -> Result<tls::WorkloadCertificate, Error>;
}

Benefits:

  • Single interface works for both SPIRE modes and the original CA client
  • No breaking changes to existing code paths
  • SecretManager can track per-workload state when needed

Consequences:

  • In PID mode, SecretManager caches by CompositeId, resulting in one cache entry per pod even if they share the same identity
  • This is intentional—each workload must be individually attested

PID Verification Flow

In PID mode, ztunnel performs the following steps:

  1. Receive certificate request with WorkloadUid
  2. Query Container Runtime (CRI) for the container's PID
  3. Call SPIRE with the PID for attestation
  4. Re-verify PID after SPIRE returns (guards against PID reuse attacks)
  5. Return certificate to caller
async fn get_cert_by_pid(&self, pid: i32, wl_uid: &WorkloadUid) -> Result<...> {
    // 1. Get certificate from SPIRE using PID
    let certs = self.get_cert_from_spire(DelegateAttestationRequest::Pid(pid)).await;
    
    // 2. Re-verify PID hasn't changed (TOCTOU protection)
    if let Some(pid_client) = &self.pid {
        let fetched_pid = pid_client.fetch_pid(wl_uid).await?;
        if fetched_pid.into_i32() != pid {
            return Err(Error::UnableToDeterminePidForWorkload(...));
        }
    }
    
    Ok(certs?)
}

Comparison Summary

Aspect Selector Mode PID Mode
Attestation Granularity Per identity (ns/sa) Per workload (pod)
Certificate Sharing Yes—same identity shares cert No—each pod gets own cert
SPIRE Calls 1 per unique identity 1 per pod
Memory Usage Lower Higher
Security Level Standard Enhanced
Cache Key CompositeId with Identity key CompositeId with WorkloadUid key

Configuration

# Enable SPIRE integration
spire_enabled: true

# Choose attestation mode
spire_mode: "ByPid"  # or "BySelectors"

# SPIRE socket path
spire_socket_path: "/run/spire/sockets/agent.sock"

# Timeout for SPIRE operations  
spire_timeout: "30s"

Future Considerations

  1. Certificate Caching with Per-Pod Attestation: In PID mode, we should cache and reuse certificates by Identity while still attesting every pod individually. This would reduce SPIRE server load and memory usage—multiple pods with the same identity would share one certificate after each pod passes local PID verification. The first pod triggers a SPIRE call; subsequent pods with the same identity only require local PID verification before reusing the cached certificate.

  2. Collaborate with SPIRE/SPIFFE Community: Work with the SPIRE and SPIFFE community to improve the Delegated Identity API and related interfaces to better support delegated attestation use cases like ztunnel's.

  3. Consider a different trait for attested workloads instead of modifying fetch_certificate.

@istio-testing istio-testing added do-not-merge/work-in-progress Block merging of a PR because it isn't ready yet. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Nov 21, 2025
@istio-testing
Copy link
Contributor

Hi @MikeZappa87. Thanks for your PR.

I'm waiting for a istio member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@MikeZappa87 MikeZappa87 force-pushed the mzappa/spirepid branch 2 times, most recently from 2880b7c to 5fa989e Compare November 26, 2025 04:25
@MikeZappa87 MikeZappa87 marked this pull request as ready for review December 5, 2025 23:02
@MikeZappa87 MikeZappa87 requested a review from a team as a code owner December 5, 2025 23:02
@istio-testing istio-testing removed the do-not-merge/work-in-progress Block merging of a PR because it isn't ready yet. label Dec 5, 2025
@istio-testing istio-testing added the needs-rebase Indicates a PR needs to be rebased before being merged label Dec 18, 2025
@istio-testing
Copy link
Contributor

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@Dimss
Copy link

Dimss commented Jan 7, 2026

Hey @MikeZappa87 we are (RH) interested in this feature, can I somehow help to push it forward ?
Looks like this PR needs rebase.
In addition, can you spot some light on the supported attestation methods?
In this PR you mentioning the Selector mode. But this PR was closed.
In the current PR you not mentioning Selector Mode, only PID mode.
To clarify, current PR includes supports for both Selector and PID attestation modes or only PID mode?

@MikeZappa87
Copy link
Contributor Author

Hey @MikeZappa87 we are (RH) interested in this feature, can I somehow help to push it forward ? Looks like this PR needs rebase. In addition, can you spot some light on the supported attestation methods? In this PR you mentioning the Selector mode. But this PR was closed. In the current PR you not mentioning Selector Mode, only PID mode. To clarify, current PR includes supports for both Selector and PID attestation modes or only PID mode?

@Dimss feel free to msg myself and Arndt on istio slack to discuss. We went the istio community sync before the holidays and had a couple action items.

I removed the selector approach as it would reduce the friction with this PR. Selector mode is a SPIRE specific implementation and does not exist in the SPIFFE specification. Right now, Arndt is doing work on the SPIFFE broker API spec which is what I believe the istio community would want as the current implementation is spire specific as a spiffe api does not exist yet.

Slack thread: https://istio.slack.com/archives/C049TCZMPCP/p1765304313250799

@tjons
Copy link

tjons commented Jan 17, 2026

@MikeZappa87 hey Mike! This is going to be super helpful... Anything I can do to help move this along? I've got a few commits on SPIRE, do you need help pushing anything on that side forward?

@MikeZappa87
Copy link
Contributor Author

@MikeZappa87 hey Mike! This is going to be super helpful... Anything I can do to help move this along? I've got a few commits on SPIRE, do you need help pushing anything on that side forward?

The istio community doesn't like the spire specific delegated identity api and want the spiffe broker endpoint api. We are working with the spiffe community to get that moving. Reach out to me on the istio slack, I can add you to the chat.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-ok-to-test needs-rebase Indicates a PR needs to be rebased before being merged size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants