Skip to main content
Version: v0.25.0 (Latest)

Operations

The EDK containers ship with the observability and operational endpoints expected of any production Java service: health, readiness, version metadata, Prometheus metrics, structured logging, distributed tracing through OpenTelemetry, and tamper-evident audit events. This page covers how those surfaces work in practice and what an operator typically wires into them.

Health, Readiness, and Version Metadata

Each container exposes:

  • GET /health: liveness. Returns 200 OK while the container is alive and able to serve. Used by Kubernetes liveness probes and container orchestrators.
  • GET /ready: readiness. Returns 200 OK only when the role-specific database pool is healthy, the KMS dependency is reachable where that service uses KMS, and the tenant resolver cache has loaded its initial state. The platform checks its platform database; tenant runtime services check the tenant workload database. Used by Kubernetes readiness probes and load-balancer health checks.
  • GET /version: release metadata. Returns the container version, the Git commit SHA, the build timestamp, and the IDK version range it was built against. Useful for verifying which delivered version is actually running after a rolling deploy.

Readiness is more stringent than liveness on purpose. A container with a healthy /health but a failing /ready is alive but not yet able to serve traffic. The deployment template uses this to keep traffic off a freshly-started replica until its dependencies are confirmed reachable.

OpenTelemetry

The EDK telemetry module is wired into every container. Spans propagate through the command transport layer, so a single trace ID covers the API gateway, the issuer or verifier protocol handler, the attribute pipeline phases, the KMS signing call, and the webhook dispatch.

The container reads OpenTelemetry configuration from the standard env vars: OTEL_EXPORTER_OTLP_ENDPOINT, OTEL_EXPORTER_OTLP_HEADERS, OTEL_SERVICE_NAME (defaulted by the container to enterprise-issuer, enterprise-verifier, and so on), and the standard sampler configuration. When OTEL is not configured, the telemetry module is a no-op.

Metrics are exposed on /metrics in Prometheus exposition format. The metrics set includes the standard JVM metrics (heap, GC, threads), HTTP server metrics (request rate, latency histograms by route, error rates), command execution metrics (per command id), pipeline metrics on the issuer (phase duration, source duration, deferral rates), DCQL query metrics on the verifier, and KMS call metrics on every runtime container.

W3C Trace Context propagation is on by default. The traceparent and tracestate headers flow through the EDK transport layer, so a trace started at the API gateway or upstream caller stays continuous through the issuer/verifier/AS/KMS chain.

Structured Logging

Logging is JSON to stdout by default. Each log entry carries the standard severity, message, and exception fields plus EDK-specific structured fields:

  • tenant_id: the resolved tenant on the call, when applicable.
  • correlation_id: the cross-request correlation identifier.
  • command_id: the EDK command id, when the log entry was emitted inside command execution.
  • trace_id and span_id, the OpenTelemetry trace context.
  • principal_id: the authenticated principal, when applicable.

For sensitive operations (a credential issuance, an OID4VP presentation, a federation handshake), the log message itself is deliberately abstract; the structured fields carry the operational detail. This keeps the log stream useful for debugging without leaking credential subject claims, federation user attributes, or other PII into a downstream log aggregator.

Audit

Audit is a separate stream from the operational log. Every command execution, authorization decision, authentication event, and admin REST mutation emits a structured audit event with the tenant, the principal, the command id, the result, and the relevant business identifiers. The audit subsystem ships with sensitive-data redaction (configurable per command), multiple output formats (JSON, CEF, OCSF), and tamper evidence via hash chaining plus signed checkpoints.

The default audit sink is a Postgres-backed event store (PostgresDatabaseEventStore). A read REST surface (/api/v1/audit/events) lets platform administrators query by tenant, principal, command, time range, and result. For long-term retention, the audit pipeline can replicate events to an external SIEM through the SSF (Shared Signals Framework) module or through a generic event transmitter.

Audit signing is per-tenant and optional. When enabled, each event is signed with a tenant-scoped audit key on the KMS ((tenant, audit, audit-checkpoint)), and periodic signed checkpoints make tampering with stored events detectable. The default is signing off; enable per tenant through audit.events.signing.enabled.

Backup and Restore

Postgres backup and restore is the principal data-protection story. Back up both database roles: the platform database and the tenant workload database with its per-tenant schemas. The EDK does not run its own scheduled backups; the deployment uses whatever backup tooling the operator runs against Postgres in general (pg_basebackup, managed-service snapshots, WAL archiving).

For a clean tenant export (for compliance or for moving a tenant to a different deployment), the tenant export REST emits a self-contained JSON document with all of a tenant's CRUD-managed entities: tenant registration, public-endpoint bindings, integrations, credential designs, attribute supplier registrations, federation providers, DCQL queries, trust source bindings, signing key aliases. Re-importing the document on a target deployment recreates the tenant configuration.

Tenant export does not include credential subject data, presentation records, audit history, or KMS-held key material. Subject data and audit history follow the standard data-retention policy; key material does not move (the new deployment generates new keys for the tenant under its own KMS).

KMS key material is the special case. The provider backends (AWS KMS, Azure Key Vault, HSMs) have their own key backup and recovery stories. The EDK does not export private key material out of the provider backend. For the software keystore provider, the backup is a copy of the keystore file; for everything else, the backup is whatever the backend supports.

Image Distribution and Delivered Versions

Sphereon publishes the enterprise images through authenticated Docker registry repositories under nexus.sphereon.com/edk-docker. The customer deployment consumes the delivered image tags and uses the Enterprise Development Kit Deployment repository for Compose, Helm, gateway examples, Postman, and provisioning scripts. Registry credentials are issued under the commercial agreement during onboarding; customers pull via standard Docker auth:

docker login nexus.sphereon.com
docker pull nexus.sphereon.com/edk-docker/enterprise-issuer:<enterprise-version>

A typical pull-and-pin pattern in the customer's deployment:

image: nexus.sphereon.com/edk-docker/enterprise-issuer:<enterprise-version>

Use the version tag supplied by Sphereon in production deployments, not :latest. Verifying the image is straightforward: the metadata at /version, the image OCI labels, and the SBOM published alongside each image all identify the product version, the source revision, and the included module versions. Use the exact tag provided for the delivery.

Helm Deployment

Sphereon delivers the edk-enterprise Helm chart alongside the images; the chart coordinates are provided during onboarding. Create the registry pull secret in the target namespace before installing:

kubectl create namespace edk
kubectl -n edk create secret docker-registry sphereon-nexus `
--docker-server=nexus.sphereon.com `
--docker-username=<user> `
--docker-password=<token> `
--docker-email=<email>

The edk-enterprise chart is published to Sphereon's authenticated Nexus repository; add it as a Helm repository with the credentials issued during onboarding. Then install or upgrade:

helm repo add sphereon <nexus-helm-repository-url> --username <user> --password <token>
helm upgrade --install edk-enterprise sphereon/edk-enterprise `
--namespace edk `
--set global.imageTag=<enterprise-version> `
--set "global.imagePullSecrets[0]=sphereon-nexus"

The chart defaults to global.imageRegistry=nexus.sphereon.com/edk-docker and deploys enterprise-platform, enterprise-tenant-kms, enterprise-did, enterprise-tenant-as, enterprise-issuer, enterprise-verifier, and the optional admin-console. KMS has no public ingress by default. The runtime services expose only their public protocol paths through public ingress and keep administrative paths behind internal ingress. The admin console is routed at /admin-console on the platform host.

The chart keeps gRPC internal. Platform and tenant-KMS expose the gRPC receiver on the cluster network; DID, tenant-AS, issuer, verifier, and tenant-KMS use the platform route for configuration and platform-managed calls, and DID, tenant-AS, issuer, and verifier use the KMS route for signing and key operations. Never publish gRPC on the external gateway.

Operator Hardening Checklist

A production-ready deployment of the EDK enterprise containers ticks the following:

  • Network isolation. KMS internal-only. Admin REST on every runtime container behind the internal ingress with bearer-JWT auth. Public ingress carries protocol paths and .well-known URLs only. NetworkPolicies on Kubernetes scope inter-container traffic to the actual call graph.
  • TLS. Public ingresses terminate TLS at the gateway. Internal communication uses mTLS (mesh) or service JWT (in-process), per the topology choice.
  • Secrets. No secret in YAML, env var, or image. Every secret a ${secret:...} reference resolved through the configured backend.
  • JWT validation. Admin REST requires a bearer JWT with the right scopes. JWT issuer URL configured per environment. JWKS refresh schedule sized to expected key rotation cadence.
  • Postgres. TLS in transit for both the platform database and the tenant workload database. Backups verified by periodic restore, including schema-per-tenant restore checks. Connection pools sized to each container's actual throughput, with the pool max below the relevant Postgres target's max_connections divided across replicas.
  • OpenTelemetry. Wired to the deployment's collector. Sampling configured to the desired retention budget.
  • Metrics. Prometheus or compatible scraper subscribed to /metrics on every container. Alerts on the standard SLIs: error rate, latency p99, KMS call failure rate, webhook dispatch backlog, Postgres connection pool exhaustion.
  • Logs. Centralised aggregation. Retention sized for the deployment's compliance requirements.
  • Audit. Sink configured (Postgres by default, replication to SIEM if applicable). Signing enabled per tenant where required.
  • Image hygiene. Pull only the delivered enterprise tags from Sphereon's authenticated registry. Verify image metadata, signatures or registry attestations when provided, and the SBOM against the operator's dependency policy.
  • Capacity probes. Load test the deployment against expected traffic before going live, sized to the tenant count and credential volume of the actual workload.
  • Failover. Platform and tenant workload database failover tested. KMS failover (where applicable) tested. The deployment template's readiness probes correctly remove a replica from rotation when its dependencies fail.
  • Runbook. Document the deployment's tenant onboarding flow, key rotation procedure, federation provider rotation procedure, and incident response for a credential-compromise event (typically: revoke through the audit pipeline, rotate the affected tenant signing key, re-issue affected credentials).

When Something Goes Wrong

A few standard diagnostics:

  • /credential returning 202 more often than expected on the issuer container points at attribute sources timing out within the syncWaitWindow. Check the issuer's pipeline metrics for per-source latency; raise the syncWaitWindow on the slow source, or move it to a later phase, or accept the deferred flow.
  • Wallet metadata fetch returning the wrong host URL points at a missing or stale tenant_public_endpoint binding. Verify the binding through the tenant admin REST; the runtime URL resolver is fail-closed by default, so an absent binding produces a refusal rather than a silent fallback.
  • Webhook deliveries failing on one consumer but the dispatcher is healthy points at a per-destination circuit breaker opening. Check the webhook delivery status REST for the affected destination; the circuit breaker auto-closes after the cool-down window once the destination recovers.
  • Federation provider connectivity errors are surfaced by the TenantIdpConnectivity test endpoint. Re-run the connectivity test after the upstream IdP recovers; the AS keeps the cached JWKS and discovery document until the test passes again.
  • Cross-replica config not propagating points at the Postgres LISTEN/NOTIFY bridge being interrupted. The TTL fallback covers this within the configured cache TTL; if a permanent block exists (network policy, Postgres permission), the event subsystem's health metric surfaces it.