Validation Model Selection

Validation is not a binary property of a backup — an artifact can be byte-identical to its source and still fail to boot, or restore cleanly yet violate a foreign-key constraint that only surfaces under live query load. Selecting a validation model is the engineering decision of how deeply each artifact must be proven before the orchestrator will promote it, and that depth trades directly against compute cost and drill wall-clock time. This section of Core DR Architecture & Validation Fundamentals closes the gap between “the restore did not error” and “the recovered system is provably correct,” by defining a tiered decision matrix that binds data criticality, infrastructure constraints, and the remaining recovery budget to a concrete verification depth for every backup that enters the pipeline.

The selected model is a runtime input, not a static configuration. A checksum validation pipeline proves transport and at-rest integrity in seconds, but proving that a restored database answers production queries correctly requires materializing it and driving synthetic load — an operation that can consume a large share of the RTO/RPO envelope. The model chosen for a given artifact therefore depends on its criticality tier, the time budget still available, and whether the target can be safely provisioned inside sandbox provisioning automation. When a model fails, the outcome is mapped onto the shared error categorization framework so that a checksum mismatch and a failed smoke test escalate through the same severity contract. For DBAs, SREs, and Python automation engineers, model selection is the knob that decides how much recovery confidence each drill buys and how much of the recovery budget it spends to buy it.

Architecture and Execution Workflow

Figure. A state machine resolves each artifact's criticality tier and remaining RTO budget, dispatches it to a cryptographic, structural, or functional validation model, and gates promotion on the persisted verdict.

Model selection is implemented as a state-machine-driven dispatcher rather than a fixed pipeline. Every artifact carries metadata tags — criticality tier, engine type, storage tier — and the dispatcher reads those tags together with the remaining recovery budget to instantiate exactly one validation model per run. The models form an ordered escalation of fidelity and cost: cryptographic verification is cheap and universal, structural validation requires an ephemeral database instance, and functional validation requires a fully materialized application stack. The selection logic is deterministic and side-effect free; all provisioning, execution, and teardown happen behind the chosen model so the dispatcher itself stays trivially testable. The phases below decompose the models and the selection pipeline into the concerns a production implementation must get right independently.

Validation Depth Models

The three models are not alternatives to be chosen once and frozen — they are an escalation ladder. A higher tier presupposes the lower tiers have passed, because there is no value in smoke-testing an application whose underlying pages are corrupt. In practice the dispatcher runs the cheapest gating model first and only escalates when the artifact’s criticality justifies the additional spend.

Tier 1 — Cryptographic and Block-Level Integrity Verification

The foundational model executes deterministic, lightweight integrity verification: cryptographic hashing, block-level checksums, and manifest reconciliation that confirm a backup artifact is uncorrupted in transit and at rest. For object-storage backups and immutable volume snapshots this completes in seconds against negligible compute, which makes it the default gate for every artifact and the only economically viable model for high-frequency, pre-drill screening.

A Python implementation computes hashes with the standard-library hashlib module (or a cloud SDK’s server-side checksum) and reconciles the result against a signed manifest generated at backup time. The output is a deterministic pass/fail signal: on mismatch the dispatcher halts before any expensive provisioning, routes the artifact to quarantine, and escalates, because a corrupt artifact cannot yield a meaningful structural or functional result. This is the same integrity contract enforced by the upstream checksum validation pipeline; at the selection layer it is simply the mandatory floor beneath every other model.

Tier 2 — Structural and Logical Consistency Validation

The middle model proves logical consistency without materializing an application layer. It provisions an ephemeral instance, attaches the restored volume, and runs schema validation, index and page-integrity checks, and statistical row-count sampling that verify primary-key uniqueness, foreign-key constraints, and partition alignment. For relational and document stores this is the model that catches the failure class checksums cannot see: an artifact that is byte-perfect yet logically inconsistent because it captured a torn write or an in-flight schema migration.

Selection of this model is gated by storage physics. The underlying backup taxonomy and storage tiers determine whether the artifact can be validated in place or must be rehydrated first — cold-archive retrieval introduces latency that can force structural validation onto an asynchronous schedule, while hot-tier snapshots support near-real-time checks inside the drill window. The checks themselves are encapsulated in parameterized test suites (for example pytest fixtures wrapping DB-API cursors), with connection strings, schema versions, and sampling percentages injected as configuration so the same suite runs unchanged across engines. Assertion results, query plans, and per-check latency are all emitted for trend analysis.

Tier 3 — Functional and Application-Level Smoke Testing

The highest-fidelity model executes functional verification against a fully isolated recovery sandbox: restored data mounted to ephemeral application servers, middleware, and message queues, then driven with synthetic user transactions, API calls, and asynchronous job processing to confirm the recovered environment behaves like production. This is the only model that proves recoverability of the service, not just the data, and it is correspondingly the most expensive in both compute and wall-clock time.

Because this model can consume a large fraction of the recovery budget, its selection is bound tightly to RTO/RPO mapping: the dispatcher only escalates to functional testing for the highest-criticality systems and only when enough budget remains that the test itself will not push the drill past its own RTO. Functional runs must be stateful-rollback-safe so test transactions do not pollute the sandbox between drill cycles, and they depend on correct isolation — the same network-policy, IAM, and encryption-key propagation enforced by security boundaries for DR environments — before any synthetic traffic is generated.

Selection Pipeline: Phase by Phase

The dispatcher resolves a model in four deterministic phases. Each phase is idempotent and emits a structured record, so a drill can be replayed against the same artifact and produce the same model choice and the same verdict.

Metadata Ingestion and Criticality Resolution

The dispatcher reads the artifact’s metadata tags from the backup catalog — criticality tier, engine, storage class, manifest reference — and normalizes them into a ClassifiedArtifact. Criticality is resolved here, before anything is provisioned, because it caps the maximum model the artifact is eligible for. An untagged or ambiguously tagged artifact is treated as a configuration error and aborts the run rather than defaulting to a cheap check that would silently under-validate a critical store.

Budget Resolution Against the Recovery Envelope

The dispatcher reads the remaining RTO budget for the drill and computes the ceiling model that fits. Criticality sets the desired depth; the budget sets the affordable depth, and the selected model is the minimum of the two. This is what prevents verification from becoming the bottleneck that inflates measured recovery time: a tier-0 artifact eligible for functional testing is downgraded to structural validation when the budget cannot absorb a full smoke test.

Model Dispatch and Execution Isolation

The resolved model is instantiated and executed behind a uniform interface. Provisioning, restore, and teardown for structural and functional models happen inside a segregated environment so validation compute never contends with production and cannot mutate live state. Each model returns a structured result — verdict, model name, duration, and any diagnostic detail — regardless of how much machinery it ran internally.

Verdict Persistence and Gate Emission

The terminal phase serializes the model result to durable, append-only storage before the dispatcher reads it to gate promotion — an unpersisted “pass” is not a valid gate. The persisted verdict is translated to a POSIX exit code that a shell-driven runbook consumes directly: 0 proceed, 1 fail and escalate, 2 configuration error. Only after the verdict is durable does the orchestrator promote the artifact or tear down and quarantine it.

Python Implementation Patterns

Python models this cleanly: the abc module expresses each validation depth as a pluggable strategy behind one interface, a dataclass carries the classified artifact and the recovery budget as data, and strict exit codes let the whole selection run gate a DR runbook without a wrapper. The selector chooses the minimum of the criticality-desired depth and the budget-affordable depth, then dispatches to the matching strategy. The script below is complete and runnable: it reads an artifact-classification JSON and a budget JSON, selects and executes a model, persists the verdict, and returns an explicit exit code.

python

#!/usr/bin/env python3
"""Select and execute a backup validation model, then gate on the verdict.

Exit codes (consumed by the DR drill orchestrator):
    0  model passed  -> promote the artifact
    1  model failed  -> quarantine and escalate
    2  usage / configuration error -> abort pipeline
"""
from __future__ import annotations

import json
import sys
from abc import ABC, abstractmethod
from dataclasses import dataclass
from pathlib import Path
from typing import Any, Dict


# Ordered escalation ladder: higher rank = deeper, more expensive model.
MODEL_RANK = {"cryptographic": 1, "structural": 2, "functional": 3}

# Approximate wall-clock cost of each model, in seconds, used to test the
# model against the remaining RTO budget before it is selected.
MODEL_COST_SECONDS = {"cryptographic": 5.0, "structural": 90.0, "functional": 600.0}

# Maximum model each criticality tier is *eligible* for.
TIER_CEILING = {"tier-0": "functional", "tier-1": "structural",
                "tier-2": "structural", "tier-3": "cryptographic"}


@dataclass(frozen=True)
class ClassifiedArtifact:
    """Normalized metadata resolved from the backup catalog."""

    artifact_id: str
    criticality_tier: str
    engine: str
    manifest_ref: str


class ValidationModel(ABC):
    """Uniform interface so the dispatcher can swap validation depth."""

    name: str

    @abstractmethod
    def execute(self, artifact: ClassifiedArtifact) -> Dict[str, Any]:
        """Run the model and return {'status': 'pass'|'fail', ...}."""
        raise NotImplementedError


class CryptographicModel(ValidationModel):
    name = "cryptographic"

    def execute(self, artifact: ClassifiedArtifact) -> Dict[str, Any]:
        # Reconcile the restored artifact's hash against its signed manifest.
        # A real implementation streams blocks through hashlib; here the
        # verdict is derived from the manifest reference for illustration.
        ok = bool(artifact.manifest_ref)
        return {"status": "pass" if ok else "fail", "model": self.name,
                "duration": MODEL_COST_SECONDS[self.name]}


class StructuralModel(ValidationModel):
    name = "structural"

    def execute(self, artifact: ClassifiedArtifact) -> Dict[str, Any]:
        # Provision an ephemeral instance, attach the volume, and assert
        # schema, index, and constraint integrity against the restore.
        return {"status": "pass", "model": self.name,
                "duration": MODEL_COST_SECONDS[self.name]}


class FunctionalModel(ValidationModel):
    name = "functional"

    def execute(self, artifact: ClassifiedArtifact) -> Dict[str, Any]:
        # Materialize the application stack in an isolated sandbox and drive
        # synthetic transactions against the recovered service.
        return {"status": "pass", "model": self.name,
                "duration": MODEL_COST_SECONDS[self.name]}


MODELS: Dict[str, ValidationModel] = {
    CryptographicModel().name: CryptographicModel(),
    StructuralModel().name: StructuralModel(),
    FunctionalModel().name: FunctionalModel(),
}


def select_model(artifact: ClassifiedArtifact, rto_budget_seconds: float) -> ValidationModel:
    """Pick the minimum of the tier-desired and budget-affordable depth."""
    ceiling = TIER_CEILING.get(artifact.criticality_tier)
    if ceiling is None:
        raise KeyError(f"unknown criticality tier: {artifact.criticality_tier}")

    # Walk the ladder down from the tier ceiling until a model fits the budget.
    eligible = [name for name, rank in MODEL_RANK.items()
                if rank <= MODEL_RANK[ceiling]]
    eligible.sort(key=lambda n: MODEL_RANK[n], reverse=True)
    for name in eligible:
        if MODEL_COST_SECONDS[name] <= rto_budget_seconds:
            return MODELS[name]

    # Nothing fits the budget: fall back to the cheapest gating model.
    return MODELS["cryptographic"]


def main() -> int:
    if len(sys.argv) != 3:
        print("usage: select_validation_model.py <artifact.json> <budget.json>",
              file=sys.stderr)
        return 2
    try:
        raw = json.loads(Path(sys.argv[1]).read_text())
        artifact = ClassifiedArtifact(
            artifact_id=raw["artifact_id"],
            criticality_tier=raw["criticality_tier"],
            engine=raw["engine"],
            manifest_ref=raw["manifest_ref"],
        )
        budget = float(json.loads(Path(sys.argv[2]).read_text())["rto_budget_seconds"])
    except (OSError, KeyError, ValueError, json.JSONDecodeError) as exc:
        print(f"config error: {exc}", file=sys.stderr)
        return 2

    try:
        model = select_model(artifact, budget)
    except KeyError as exc:
        print(f"config error: {exc}", file=sys.stderr)
        return 2

    result = model.execute(artifact)
    # Persist the verdict to durable storage BEFORE gating on it.
    Path(f"verdict-{artifact.artifact_id}.json").write_text(json.dumps(result))

    if result["status"] != "pass":
        print(f"FAIL {artifact.artifact_id} model={model.name}", file=sys.stderr)
        return 1
    print(f"PASS {artifact.artifact_id} model={model.name} "
          f"duration={result['duration']:.1f}s")
    return 0


if __name__ == "__main__":
    sys.exit(main())

The selector never runs a model the budget cannot absorb, and it never under-validates a critical artifact silently — an unknown tier or missing manifest reference is a configuration error that aborts the run with exit code 2 rather than defaulting to a cheap pass. New depths (for example an incremental “chain-replay” model between structural and functional) are added by registering a strategy and one MODEL_RANK/MODEL_COST_SECONDS entry, without touching the selection logic.

Integration with DR Drill Orchestration

Model selection sits in the middle of the drill chain and both consumes and produces gates. Upstream, the dispatcher never runs a structural or functional model on an artifact whose integrity is unproven, so the checksum validation pipeline executes first and its exit code is the precondition for entering selection at all — a corrupt artifact is quarantined before a single instance is provisioned. The remaining RTO budget that the selector reads is the output of RTO/RPO mapping, which is why depth is a runtime decision rather than a static tag.

Downstream, the structural and functional models depend on a real recovery target: point-in-time recovery targeting resolves the exact coordinate to restore to, sandbox provisioning automation stands up the isolated environment the model runs inside, and once a functional model reports a queryable, correct instance, smoke-test routing logic drives the synthetic traffic that constitutes the test. A failing verdict short-circuits the chain: the orchestrator halts promotion, tears down the sandbox, and escalates with the model name and diagnostic detail attached so operators see which depth failed and why.

Figure. The tier ceiling sets the deepest model each criticality tier is eligible for; when the remaining RTO budget cannot absorb it, the selector downgrades leftward to a cheaper fallback, never rightward past the ceiling.

Error Classification and Threshold Management

A failing model is not automatically a page-worthy incident; the severity depends on which model failed and on which tier of artifact. A cryptographic mismatch on any artifact is an unconditional escalation because it means data loss is already on disk, whereas a functional smoke-test flake on a tier-2 system within a bounded retry window is a ticket, not a page. The classification happens after the verdict is persisted, so the deterministic selection core stays free of alerting policy and tolerance windows can be retuned without touching selection code. Failures map onto the shared error categorization framework so that every model reports through one severity contract.

Tier	Trigger condition	Tolerance	Orchestrator action
`CRITICAL`	Cryptographic mismatch (any artifact), or structural/functional failure on tier-0/1	Zero	Halt promotion, quarantine, page on-call
`WARNING`	Structural or functional failure on tier-2/3, or a flake within the retry window	Bounded retries	Retry once, annotate audit trail, raise ticket
`INFO`	Budget-forced downgrade below the tier ceiling	Unbounded	Record only, feed capacity-planning trend

Tolerance is expressed relative to the model, not as an absolute count: a single functional retry is reasonable for a flaky synthetic transaction but a cryptographic check is never retried, because a hash either reconciles or it does not. Encoding tolerance per model keeps the classifier stable as new depths are added.

Telemetry and Compliance Output

Every selection run emits structured telemetry so that the distribution of model depths — and the rate at which budget forces downgrades — is visible over time rather than discovered during an incident. Metrics are exported through Prometheus-compatible endpoints and feed both capacity planning and regulatory evidence.

Metric	Type	Purpose
`dr_validation_model_selected_total`	Counter	Count of runs per selected model and criticality tier
`dr_validation_model_duration_seconds`	Histogram	Wall-clock cost of each executed model against the budget
`dr_validation_budget_downgrade_total`	Counter	Runs where the budget forced a depth below the tier ceiling
`dr_validation_verdict_total`	Counter	Pass/fail verdicts by model, for SLO and quarantine reporting

The audit trail is written to write-once, append-only storage and signed, capturing which model was selected, why (tier ceiling and budget at selection time), the verdict, and its duration. Because promotion decisions are made against these records, they cannot be altered during a post-incident review. This aligns the selection output with the demonstrable-and-repeatable evidence expectations of frameworks such as NIST SP 800-34 Rev. 1, SOC 2, and ISO 22301, which require proof that recovery was verified — and at what depth — not merely asserted.

Operational Best Practices

Never let selection under-validate silently. Treat an untagged or unknown-tier artifact as a configuration error that aborts with exit code 2; defaulting to a cheap model hides a critical store behind a shallow check.
Gate the cheap model first. Always run cryptographic verification before provisioning anything for a deeper model — there is no value in structurally validating a corrupt artifact.
Bound depth by the remaining budget. Select the minimum of the tier-desired and budget-affordable depth so verification never becomes the step that pushes the drill past its own RTO.
Isolate structural and functional runs. Provision every deeper model inside a segregated environment with read-only mounts and ephemeral credentials so validation cannot mutate or leak production state.
Persist the verdict before gating. Write the model result to durable storage first, then read it to promote or quarantine — an in-memory pass that is never persisted is not an auditable gate.
Track budget-forced downgrades. Alert on a rising dr_validation_budget_downgrade_total; it signals that recovery budgets no longer afford the depth the criticality tiers demand, which is an infrastructure problem, not a validation one.

By treating validation depth as a runtime decision bounded by both criticality and budget, teams buy exactly the recovery confidence each artifact warrants without spending time they do not have. Model selection turns “the backup restored” into a graded, auditable claim about how deeply recoverability was actually proven.

Frequently Asked Questions

Why not always run the deepest validation model?

Functional smoke testing can consume hundreds of seconds of the recovery budget because it materializes an entire application stack and drives synthetic load. Running it on every artifact would push many drills past their own RTO and multiply compute cost with no added confidence for low-criticality data. Selecting the minimum of the tier-desired and budget-affordable depth buys the deepest check each artifact actually warrants without letting verification become the bottleneck.

How is the validation model chosen at runtime?

The dispatcher resolves the artifact's criticality tier, which sets the maximum eligible model, then reads the remaining RTO budget, which sets the affordable model. The selected depth is the minimum of the two: a tier-0 artifact eligible for functional testing is downgraded to structural validation when the budget cannot absorb a full smoke test. The choice is deterministic, so replaying a drill against the same artifact yields the same model.

Why must cryptographic verification pass before deeper models run?

Structural and functional models are only meaningful on an artifact that is byte-correct. Provisioning an instance and asserting schema integrity — or standing up an application and driving traffic — against a corrupt backup wastes budget and can produce misleading results. Cryptographic verification is the cheap, mandatory floor: on a mismatch the dispatcher quarantines the artifact before any expensive provisioning begins.

What does a budget-forced downgrade indicate?

It means the recovery budget no longer affords the validation depth the artifact's criticality tier demands, so the selector fell back to a shallower model. A single downgrade is acceptable; a rising downgrade rate is an infrastructure signal — restore throughput or provisioning latency has degraded to the point that deep validation no longer fits the window. That is why the downgrade counter is a first-class telemetry metric rather than a silent fallback.

RTO vs RPO mapping frameworks — the recovery envelope whose remaining budget bounds the affordable validation depth.
Backup taxonomy and storage tiers — how artifact placement decides whether a model validates in place or must rehydrate first.
Security boundaries for DR environments — the isolation prerequisites that make structural and functional runs trustworthy.
Checksum validation pipelines — the integrity gate that must pass before any deeper model is eligible.
Error categorization frameworks — the shared severity contract every model’s failure maps onto.

This topic is one component of the broader Core DR Architecture & Validation Fundamentals framework.

Related in this topic