Fallback Chain Design for Kubernetes Clusters

This page implements one concrete task inside fallback chain configuration: a headless Python orchestrator that reads live Kubernetes health signals during a stateful restore drill, selects a deterministic recovery tier, and exits with a strict POSIX status code the drill runner can branch on. Stateful workload recovery on Kubernetes needs deterministic failure routing — a linear retry loop introduces cascading validation blackouts and leaves DBAs and SREs blind to partial restoration states. The fallback chain here operates as a directed execution graph that decouples control plane reconciliation from data plane validation, so verification proceeds through isolated execution tiers without ever risking a split-brain promotion. Each tier’s outcome maps to a status tier that feeds downstream error categorization frameworks, the drill runs inside the boundaries established by sandbox provisioning automation, and a “green” verdict only ever means the restore converged and passed a read-only probe inside the windows defined by your RTO and RPO mapping.

Architecture and Execution Model

Figure. Finite state machine logic that selects a Kubernetes fallback tier from cluster health signals before running application validation gates.

The orchestrator prioritizes deterministic tier selection, explicit timeout enforcement, and strict error propagation. It queries the Kubernetes API for PersistentVolumeClaim phase transitions, etcd member health, and CSI driver readiness, then advances to the furthest tier the Kubernetes cluster currently qualifies for — falling back toward control-plane reconciliation as health degrades. A bound volume implies a healthy control plane, so the volume-attachment tier is evaluated first. Selection logic is isolated from the drill runner: the script returns strict exit codes and structured, line-oriented logs, and makes no decisions about alerting or promotion itself. Network isolation activates automatically to quarantine the drill namespace before any degraded tier engages, preventing cross-namespace contamination of the production data plane.

Prerequisites

Python 3.8+ (the IntEnum ordering and typing usage are stable from 3.8 onward).
The official Kubernetes client, installed into the automation environment:
bash
```
pip install "kubernetes>=29.0"
```
In-cluster or kubeconfig access to the target cluster. Run the orchestrator from a Job inside the isolated drill namespace (in-cluster) or from a bastion with a scoped kubeconfig — never against the production control plane.

A dedicated drill ServiceAccount with a Role/RoleBinding granting read on pods, PVCs, StatefulSet status, and ConfigMaps, plus create on NetworkPolicy and pods/exec within the drill namespace:

yaml

rules:
  - apiGroups: [""]
    resources: ["pods", "persistentvolumeclaims", "configmaps"]
    verbs: ["get", "list"]
  - apiGroups: [""]
    resources: ["pods/exec"]
    verbs: ["create"]
  - apiGroups: ["apps"]
    resources: ["statefulsets", "statefulsets/status"]
    verbs: ["get"]
  - apiGroups: ["networking.k8s.io"]
    resources: ["networkpolicies"]
    verbs: ["create"]

An etcd snapshot status ConfigMap written by the backup stage after etcdctl snapshot status matches the backup manifest and cluster UUID. The orchestrator reads this flag rather than trusting the snapshot at run time — validating a corrupted snapshot against itself proves nothing.

Production Implementation

The orchestrator collects four health signals, selects a FallbackTier, executes it under a bounded timeout, and gates promotion on a read-only transaction against the restored StatefulSet. It returns a deterministic status for CI/CD or DR automation hooks: any tier that fails to converge, or a validation gate that rejects the restore, fails the run.

python

#!/usr/bin/env python3
"""Kubernetes DR fallback-chain orchestrator.

Selects a recovery tier from live cluster health signals, executes it under a
bounded timeout, and gates StatefulSet promotion on a read-only validation query.

Exit codes (consumed by the DR drill runner):
    0  a fallback tier converged and the read-only validation gate passed
    1  every eligible tier failed, or the validation gate rejected the restore
    2  usage / configuration / cluster-access error -> abort the drill
"""
import logging
import sys
import time
from enum import IntEnum
from typing import Any, Dict

from kubernetes import client, config
from kubernetes.client.rest import ApiException
from kubernetes.stream import stream

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s | %(levelname)s | %(name)s | %(message)s",
)
logger = logging.getLogger("dr_fallback_orchestrator")


class FallbackTier(IntEnum):
    VOLUME_ATTACHMENT_SYNC = 1
    CONTROL_PLANE_RECONCILE = 2
    DEGRADED_STATEFULSET = 3


class DrillOrchestrator:
    def __init__(self, namespace: str, statefulset: str, timeout: int = 300):
        try:
            config.load_incluster_config()
        except config.ConfigException:
            config.load_kube_config()
        self.core_v1 = client.CoreV1Api()
        self.apps_v1 = client.AppsV1Api()
        self.net_v1 = client.NetworkingV1Api()
        self.namespace = namespace
        self.statefulset = statefulset
        self.timeout = timeout

    # --- health signal collection -------------------------------------------
    def collect_state(self) -> Dict[str, Any]:
        return {
            "pvc_bound": self._all_pvcs_bound(),
            "csi_driver_healthy": self._csi_driver_healthy(),
            "api_server_ready": self._api_server_ready(),
            "etcd_snapshot_verified": self._etcd_snapshot_verified(),
        }

    def _all_pvcs_bound(self) -> bool:
        claims = self.core_v1.list_namespaced_persistent_volume_claim(self.namespace)
        return bool(claims.items) and all(
            c.status.phase == "Bound" for c in claims.items
        )

    def _csi_driver_healthy(self) -> bool:
        pods = self.core_v1.list_namespaced_pod(
            "kube-system", label_selector="app=csi-driver"
        )
        return bool(pods.items) and all(
            p.status.phase == "Running" for p in pods.items
        )

    def _api_server_ready(self) -> bool:
        try:
            self.core_v1.get_api_resources()
            return True
        except ApiException:
            return False

    def _etcd_snapshot_verified(self) -> bool:
        # The backup stage writes verified=true only after `etcdctl snapshot
        # status` matches the manifest checksum and the cluster UUID.
        try:
            cm = self.core_v1.read_namespaced_config_map(
                "etcd-snapshot-status", self.namespace
            )
        except ApiException:
            return False
        return (cm.data or {}).get("verified", "false").lower() == "true"

    # --- tier selection ------------------------------------------------------
    def evaluate_transition(self, state: Dict[str, Any]) -> FallbackTier:
        # Advance to the furthest tier the cluster qualifies for; a bound volume
        # implies a healthy control plane, so evaluate the volume tier first.
        if state.get("csi_driver_healthy") and state.get("pvc_bound"):
            return FallbackTier.VOLUME_ATTACHMENT_SYNC
        if state.get("etcd_snapshot_verified") and state.get("api_server_ready"):
            return FallbackTier.CONTROL_PLANE_RECONCILE
        self._apply_egress_isolation()
        return FallbackTier.DEGRADED_STATEFULSET

    def _apply_egress_isolation(self) -> None:
        policy = client.V1NetworkPolicy(
            metadata=client.V1ObjectMeta(
                name="dr-drill-egress-deny", namespace=self.namespace
            ),
            spec=client.V1NetworkPolicySpec(
                pod_selector=client.V1LabelSelector(),
                policy_types=["Egress"],
                egress=[
                    client.V1NetworkPolicyEgressRule(
                        to=[
                            client.V1NetworkPolicyPeer(
                                namespace_selector=client.V1LabelSelector(
                                    match_labels={
                                        "kubernetes.io/metadata.name": "kube-system"
                                    }
                                )
                            )
                        ]
                    )
                ],
            ),
        )
        try:
            self.net_v1.create_namespaced_network_policy(self.namespace, policy)
            logger.warning("Applied egress quarantine before degraded tier")
        except ApiException as exc:
            if exc.status != 409:  # already applied on a retried drill is fine
                raise

    # --- tier execution ------------------------------------------------------
    def _wait_statefulset_ready(self) -> bool:
        deadline = time.monotonic() + self.timeout
        while time.monotonic() < deadline:
            sts = self.apps_v1.read_namespaced_stateful_set_status(
                self.statefulset, self.namespace
            )
            desired = sts.spec.replicas or 0
            ready = sts.status.ready_replicas or 0
            if desired and ready == desired:
                return True
            time.sleep(5)
        return False

    def execute_tier(self, tier: FallbackTier) -> bool:
        logger.info("Executing fallback tier: %s", tier.name)
        # Every tier ultimately converges the StatefulSet; the isolation policy
        # for the degraded tier has already been applied during selection.
        return self._wait_statefulset_ready()

    # --- application validation gate -----------------------------------------
    def validation_gate(self) -> bool:
        pod = f"{self.statefulset}-0"
        command = [
            "psql", "-tAc",
            "SET default_transaction_read_only = on; SELECT 1;",
        ]
        try:
            resp = stream(
                self.core_v1.connect_get_namespaced_pod_exec,
                pod, self.namespace,
                command=command,
                stderr=True, stdin=False, stdout=True, tty=False,
                _request_timeout=self.timeout,
            )
        except ApiException as exc:
            logger.error("Validation exec failed on %s: %s", pod, exc)
            return False
        return resp.strip().endswith("1")


def main() -> int:
    if len(sys.argv) != 3:
        logger.error("Usage: dr_fallback_orchestrator.py <namespace> <statefulset>")
        return 2

    namespace, statefulset = sys.argv[1], sys.argv[2]
    try:
        orch = DrillOrchestrator(namespace, statefulset)
    except config.ConfigException as exc:
        logger.error("Cluster access error: %s", exc)
        return 2

    state = orch.collect_state()
    logger.info("Cluster health signals: %s", state)
    tier = orch.evaluate_transition(state)

    if not orch.execute_tier(tier):
        logger.critical("Tier %s did not converge within %ss", tier.name, orch.timeout)
        return 1

    if not orch.validation_gate():
        logger.critical("Read-only validation gate rejected the restore")
        return 1

    logger.info("Tier %s converged and validation passed; drill green", tier.name)
    return 0


if __name__ == "__main__":
    sys.exit(main())

Figure. Each health-signal guard selects exactly one ordered fallback tier; all three converge on the same read-only validation gate, whose verdict maps to the POSIX exit codes the drill runner branches on.

Step-by-Step Execution Walkthrough

The orchestrator assumes the isolated drill namespace already exists and the restore has been staged into it. From there the drill proceeds through deterministic steps:

Confirm the etcd snapshot flag. Before the orchestrator runs, the backup stage must verify snapshot integrity and write the result. Use etcdctl snapshot status and compare the reported hash and cluster UUID against the backup manifest, then patch the etcd-snapshot-status ConfigMap with verified=true.

Clear orphaned volume state, if any. If CSI drivers report attachment timeouts, strip finalizers from orphaned PersistentVolume objects so the claims can rebind before the volume tier is evaluated:

bash

# Identify stuck PVCs in the drill namespace
kubectl get pvc -n dr-drill-sandbox \
  -o jsonpath='{range .items[?(@.status.phase=="Pending")]}{.metadata.name}{"\n"}{end}'

# Strip finalizers from orphaned PVs to force CSI detachment
kubectl patch pv <pv-name> -p '{"metadata":{"finalizers":[]}}' --type=merge

# Verify CSI driver attachment status
kubectl get csistoragecapacities -A
kubectl describe pod -n kube-system -l app=csi-driver | grep -A5 "AttachVolume"

Run the orchestrator against the drill namespace and target StatefulSet:

bash

python3 dr_fallback_orchestrator.py dr-drill-sandbox app-db; echo "exit=$?"

Branch on the exit code. 0 promotes the restored instance, 1 triggers rollback (scale the StatefulSet to zero and restore the previous PVC snapshot), and 2 aborts the drill on a malformed invocation or lost cluster access — mapped explicitly in the orchestration wrapper below.

Verification and Expected Output

A clean run logs the collected signals, the selected tier, convergence, and a green line, and exits 0:

text

2026-07-05 04:12:01 | INFO | dr_fallback_orchestrator | Cluster health signals: {'pvc_bound': True, 'csi_driver_healthy': True, 'api_server_ready': True, 'etcd_snapshot_verified': True}
2026-07-05 04:12:01 | INFO | dr_fallback_orchestrator | Executing fallback tier: VOLUME_ATTACHMENT_SYNC
2026-07-05 04:12:31 | INFO | dr_fallback_orchestrator | Tier VOLUME_ATTACHMENT_SYNC converged and validation passed; drill green

A degraded run applies the egress quarantine, converges the fallback tier, but fails the read-only probe and exits 1:

text

2026-07-05 04:12:01 | WARNING | dr_fallback_orchestrator | Applied egress quarantine before degraded tier
2026-07-05 04:12:01 | INFO | dr_fallback_orchestrator | Executing fallback tier: DEGRADED_STATEFULSET
2026-07-05 04:14:41 | CRITICAL | dr_fallback_orchestrator | Read-only validation gate rejected the restore

The exit code is the contract the drill runner reads:

0 — a tier converged and the validation gate passed. Proceed to promotion.
1 — no eligible tier converged, or the gate rejected the restore. Roll back and quarantine.
2 — missing arguments or lost cluster access. Abort the drill.

Failure Modes and Troubleshooting

Symptom	Cause	Remediation
Orchestrator selects `DEGRADED_STATEFULSET` unexpectedly	The `etcd-snapshot-status` ConfigMap is missing or `verified` is not `true`	Confirm the backup stage patched the ConfigMap after `etcdctl snapshot status`; check the drill ServiceAccount has `get` on ConfigMaps
`ApiException: 403 Forbidden` on `NetworkPolicy` create	Drill `Role` lacks `create` on `networking.k8s.io/networkpolicies`	Add the verb to the `Role`; a degraded tier must be able to quarantine before it runs
PVCs stuck `Pending`, volume tier never selected	CSI attachment deadlock from a stale `VolumeAttachment` finalizer	Strip PV finalizers as in the walkthrough; confirm CSI driver minor-version parity with the target cluster
`_wait_statefulset_ready` times out at the boundary	`spec.replicas` exceeds available nodes, or a pod is `FailedScheduling`	Inspect `kubectl describe pod`; raise the `timeout` only after confirming the scheduler can place all replicas
Validation gate returns `False` on a healthy DB	`psql` not on the pod image, or the probe pod name differs from `<statefulset>-0`	Match the exec command to the container’s client binary; confirm the ordinal-0 pod hosts the primary
`409 Conflict` logged then run continues	A prior drill left the egress policy in place	Expected on retries — the handler treats `409` as idempotent; clean up policies in the namespace teardown

Maintain etcd version parity between backup artifacts and the target cluster to prevent schema migration failures during snapshot restoration, and validate CSI driver compatibility with the target Kubernetes minor version before initiating volume reconciliation. Implement exponential backoff in the wrapper for API retries so concurrent drills do not saturate the control plane, and prefer kubectl wait --for=condition=Ready --timeout=... over custom polling where a single object suffices.

Integration Notes

The orchestrator is built for headless execution. A thin shell wrapper turns its exit code into a rollback decision and an alert:

bash

#!/usr/bin/env bash
set -euo pipefail

python3 dr_fallback_orchestrator.py dr-drill-sandbox app-db
case $? in
  0) echo "[$(date -u)] restore validated; promoting" ;;
  1) kubectl scale statefulset app-db -n dr-drill-sandbox --replicas=0
     curl -s -X POST "$PAGERDUTY_WEBHOOK" -d '{"event":"dr_restore_rollback"}'
     exit 1 ;;
  *) echo "[$(date -u)] orchestrator misconfigured; aborting"; exit 2 ;;
esac

Wire that wrapper into whichever scheduler owns the drill:

Airflow — invoke it from a KubernetesPodOperator (or a BashOperator that inspects returncode); a non-zero exit fails the task and short-circuits the downstream promotion task, keeping the DAG run history as the audit trail.
Celery — wrap the call in a task that raises on non-zero so the broker records the failure, giving event-driven drills — triggered when a fresh backup lands — low-latency dispatch.
cron — schedule the wrapper directly on the bastion; because it returns strict POSIX codes, cron/systemd OnFailure handlers can route rollback alerts without extra glue.

Because the exit code is the same contract used by database-native validators such as the checksum validation pipeline, a single orchestration wrapper can compose logical-integrity checks and this Kubernetes cluster-level fallback chain into one gated drill. For exact status-field semantics during phase transitions consult the Kubernetes PersistentVolumeClaim API; for version-specific snapshot constraints see the official etcd recovery procedures.

Frequently Asked Questions

Why evaluate the volume-attachment tier before the control-plane tier?

A bound PersistentVolumeClaim with a healthy CSI driver is strictly stronger evidence than a verified etcd snapshot: it implies the control plane reconciled far enough to attach storage. Evaluating the volume tier first means the orchestrator advances to the furthest tier the Kubernetes cluster genuinely qualifies for, rather than settling for control-plane reconciliation when the data plane is already available. The chain only falls back toward the control-plane and degraded tiers as those stronger signals disappear.

Why quarantine egress before the degraded tier instead of at drill start?

The volume and control-plane tiers run against a restore that already reconciled cleanly and is validated read-only, so blanket egress denial would add no safety and could break the validation exec. The degraded tier, by contrast, brings up a StatefulSet whose data plane state is uncertain — that is exactly when an accidental write or replication callout to production is most dangerous. Applying the NetworkPolicy at that transition confines the blast radius to the moment it matters.

Why gate on a read-only transaction rather than pod readiness alone?

A ready StatefulSet only proves the container passed its liveness and readiness probes; it says nothing about whether the restored data is queryable. Running SET default_transaction_read_only = on; SELECT 1; proves the database engine accepted a session and executed a statement against the restored volume without permitting any mutation of the drill data. It is the minimal probe that distinguishes "the pod started" from "the backup restored into a usable database."

What is the difference between exit code 2 and exit code 1?

Exit 2 is an abort condition — wrong argument count or lost cluster access — where the orchestrator never judged any restore. Exit 1 is a verdict: it selected and executed a tier and either the StatefulSet failed to converge within the timeout or the validation gate rejected the restore. Runners should treat 2 as "fix the invocation" and 1 as "roll back the drill namespace and escalate."

Fallback Chain Configuration — the parent workflow this Kubernetes orchestrator plugs into as its per-cluster tier-selection stage.
Sandbox Provisioning Automation — how the isolated drill namespace and its storage are instantiated before the chain runs.
Smoke Test Routing Logic — where a green fallback verdict hands off to service-level probes across the restored topology.
Point-in-Time Recovery Targeting — selecting the recovery point the restored StatefulSet is validated against.
Error Categorization Frameworks — mapping each tier’s failure into an orchestrator action.

This script is one component of the broader Fallback Chain Configuration workflow, itself part of Restore Drill Orchestration & Environment Isolation.