Why select the oplog entry at or before the target instead of the closest one?

Replaying past the requested instant would apply operations the recovery point is meant to exclude. Selecting the greatest ts.t that is still less than or equal to the target guarantees WiredTiger replays a contiguous committed prefix and stops exactly at or before the boundary, preserving transactional consistency and audit meaning.

What does the oplog ordinal ts.i protect against?

A busy primary can commit many operations in a single second, and the BSON timestamp disambiguates them with a low-order ordinal counter. Sorting on both seconds and ordinal ensures the boundary is the last committed operation of the target second, so oplogLimit stops on a real commit rather than mid-second.

How do I keep a sharded MongoDB restore consistent across shards?

Run the resolver per shard because each shard keeps an independent oplog window, then reconcile the per-shard boundaries to a common instant before restore. Pin resolution and restore to secondaries for the drill window so an election cannot move a boundary, and reject the drill if any shard cannot reach the shared target.

Point-in-Time Targeting for MongoDB Backups

This page implements one concrete task inside point-in-time recovery targeting: a headless Python resolver that translates an arbitrary ISO-8601 recovery request into the nearest committed MongoDB oplog boundary, then emits a deterministic mongorestore --oplogLimit invocation the disaster recovery orchestrator can execute unattended. Automated recovery validation for MongoDB routinely fractures at the timestamp-resolution layer: application teams request recovery to an exact transaction boundary, yet the backup infrastructure operates on discrete WiredTiger checkpoint intervals and continuous incremental oplog captures. The operational gap widens on sharded clusters, where clock skew, oplog truncation, and replica-set election delays can silently corrupt referential integrity. The resolver here closes that gap by refusing to guess — it maps a wall-clock epoch to an oplog record that actually exists, classifies an unreachable target as a hard failure rather than a silent fallback to the last full snapshot, and only ever runs against an isolated environment produced by sandbox provisioning. Whether a resolved boundary is acceptable is a separate question answered by your RTO and RPO mapping: a boundary that lands outside the recovery-point budget is a failed drill even when the restore itself succeeds.

MongoDB’s continuous backup model pairs periodic mongodump snapshots with continuous oplog tailing. The mongorestore utility accepts an --oplogLimit parameter, but it expects a BSON timestamp in the <seconds>:<ordinal> form, not an ISO-8601 string. Automation must therefore query the backup metadata catalog, locate the closest preceding oplog entry, and convert the epoch to its exact BSON representation. Failing to align the target with a real oplog record forces the restore engine to either truncate mid-transaction or roll back to the last full snapshot — both of which invalidate compliance audits.

Architecture and Execution Model

Figure. The resolver maps an ISO-8601 target to the nearest committed oplog boundary, failing fast (exit 1) when no entry lands at or before the request and emitting a deterministic mongorestore --oplogLimit invocation (exit 0) when one does.

MongoDB oplog entries store time in a 64-bit BSON timestamp: the high-order 32 bits are seconds since the Unix epoch (ts.t), and the low-order 32 bits are an ordinal counter (ts.i) that disambiguates operations landing within the same second. This guarantees strict ordering within a replica set but introduces friction when mapping human-readable timestamps to restore boundaries. When a request specifies 2024-11-15T14:30:00Z, the pipeline cannot assume an oplog entry exists at that exact second; it must select the latest entry where ts.t <= target_epoch. That boundary rule prevents partial transaction application and lets WiredTiger replay operations atomically up to the exact commit point. In sharded deployments each shard keeps an independent oplog window, so targeting resolves a boundary per shard and the orchestrator synchronizes execution across the sharded cluster to preserve cross-shard consistency.

The resolver operates in three deterministic phases:

Normalization — convert the ISO-8601 target to a Unix epoch integer.
Catalog query — fetch the oplog window for the target shard with a buffer (typically 300 seconds) to absorb network latency and catalog indexing delay.
Boundary selection — filter to entries where ts.t <= target_epoch, take the maximum ts.t, and extract its ordinal ts.i.

If no valid entry exists within the retention window, the pipeline fails fast rather than defaulting to an arbitrary snapshot — preserving audit integrity over convenience.

Prerequisites

Python 3.8+ (the type hints and datetime.datetime.fromisoformat offset handling are stable from 3.8 onward).
The requests library and a retry-capable adapter, installed into the automation environment:
bash
```
pip install "requests>=2.31" "urllib3>=2.0"
```
The MongoDB Database Tools (mongorestore) reachable on the automation host, matching the server’s major version. The resolver only emits the command; a downstream step executes it inside the sandbox.
A backup catalog API that exposes structured oplog metadata per shard — the resolver expects a GET /api/v1/oplog endpoint returning {"oplog_entries": [{"ts": {"t": <int>, "i": <int>}}, ...]}. Ops Manager, Percona Backup for MongoDB, or an in-house catalog over the oplog collection all satisfy this contract.
A read-only catalog credential injected from a secret store. The resolver never writes; it reads oplog metadata and produces a command string, so the credential can be scoped to read on the catalog service.
Confirmation that the target shard’s oplog retention covers the requested epoch. Check db.getReplicationInfo().timeDiff on each shard before scheduling a drill whose target predates the oldest retained oplog entry.

Production Implementation

The resolver parses the ISO target, queries the catalog for a bounded window, selects the greatest committed boundary at or before the target, and prints a fully-formed mongorestore command. It returns strict POSIX exit codes so the orchestrator can branch without parsing stdout, and it treats an unreachable target as a distinct failure class from a malformed invocation.

python

#!/usr/bin/env python3
"""MongoDB point-in-time recovery boundary resolver.

Maps an ISO-8601 wall-clock recovery target to a committed WiredTiger oplog
boundary and emits a deterministic mongorestore --oplogLimit invocation.

Exit codes (consumed by the DR drill orchestrator):
    0  a committed oplog boundary was resolved       -> proceed with restore
    1  target unreachable: no entry <= target, or
       the epoch falls outside the retention window  -> halt drill, escalate
    2  usage / configuration / catalog-API error     -> abort pipeline
"""
import argparse
import datetime
import logging
import sys
from typing import Dict, List, Tuple

import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s | %(levelname)s | %(name)s | %(message)s",
)
logger = logging.getLogger("mongo_pitr_resolver")

BUFFER_SECONDS = 300  # lookback window to absorb latency and catalog indexing lag


class TargetUnreachable(Exception):
    """Requested epoch cannot map to a committed oplog entry in retention."""


def build_session(retries: int = 3) -> requests.Session:
    session = requests.Session()
    retry = Retry(
        total=retries,
        backoff_factor=1,
        status_forcelist=[500, 502, 503, 504],
        allowed_methods=["GET"],
    )
    adapter = HTTPAdapter(max_retries=retry)
    session.mount("http://", adapter)
    session.mount("https://", adapter)
    return session


def parse_target_epoch(target_iso: str) -> int:
    """Convert an ISO-8601 target (trailing Z allowed) to an integer epoch."""
    dt = datetime.datetime.fromisoformat(target_iso.replace("Z", "+00:00"))
    if dt.tzinfo is None:
        dt = dt.replace(tzinfo=datetime.timezone.utc)
    return int(dt.timestamp())


def resolve_oplog_boundary(
    target_epoch: int, catalog_url: str, shard: str, session: requests.Session
) -> Tuple[int, int]:
    """Return (ts_seconds, ts_ordinal) of the latest committed entry <= target.

    Raises TargetUnreachable when no such entry exists in the retention window.
    """
    params = {"shard": shard, "after": target_epoch - BUFFER_SECONDS, "limit": 500}
    resp = session.get(f"{catalog_url}/api/v1/oplog", params=params, timeout=15)
    resp.raise_for_status()

    entries: List[Dict] = resp.json().get("oplog_entries", [])
    valid = [e for e in entries if e["ts"]["t"] <= target_epoch]
    if not valid:
        raise TargetUnreachable(
            f"No oplog entry at or before {target_epoch} for shard '{shard}'. "
            "Verify retention window and catalog indexing."
        )

    closest = max(valid, key=lambda e: (e["ts"]["t"], e["ts"]["i"]))
    return closest["ts"]["t"], closest["ts"]["i"]


def build_restore_command(shard: str, seconds: int, ordinal: int, archive: str) -> str:
    oplog_limit = f"{seconds}:{ordinal}"
    return (
        f"mongorestore --host {shard} --port 27017 "
        f"--oplogReplay --oplogLimit {oplog_limit} "
        f"--archive={archive} "
        f"--numParallelCollections 4 --writeConcern 'majority'"
    )


def main() -> int:
    parser = argparse.ArgumentParser(description="Resolve a MongoDB PITR boundary.")
    parser.add_argument("target_iso", help="ISO-8601 recovery target, e.g. 2024-11-15T14:30:00Z")
    parser.add_argument("catalog_url", help="Base URL of the backup catalog API")
    parser.add_argument("shard", help="Shard / replica-set host to target")
    parser.add_argument("--archive", default="/backup/latest/{shard}.archive",
                        help="Archive path template ({shard} is substituted)")
    args = parser.parse_args()

    try:
        target_epoch = parse_target_epoch(args.target_iso)
    except ValueError as exc:
        logger.error("Invalid ISO-8601 timestamp %r: %s", args.target_iso, exc)
        return 2

    session = build_session()
    try:
        seconds, ordinal = resolve_oplog_boundary(
            target_epoch, args.catalog_url, args.shard, session
        )
    except TargetUnreachable as exc:
        logger.critical("Target unreachable: %s", exc)
        return 1
    except requests.RequestException as exc:
        logger.error("Catalog API request failed: %s", exc)
        return 2
    except (KeyError, ValueError) as exc:
        logger.error("Malformed catalog response: %s", exc)
        return 2

    archive = args.archive.format(shard=args.shard)
    command = build_restore_command(args.shard, seconds, ordinal, archive)
    logger.info("Resolved boundary: oplogLimit=%s:%s", seconds, ordinal)
    print(f"OPLOG_LIMIT: {seconds}:{ordinal}")
    print(f"EXECUTE: {command}")
    return 0


if __name__ == "__main__":
    sys.exit(main())

Splitting parsing, resolution, and command assembly into separate functions keeps each unit-testable and lets the orchestrator import resolve_oplog_boundary directly when it prefers structured returns over stdout scraping. The (ts.t, ts.i) tuple sort is deliberate: when the target lands inside a busy second, the ordinal breaks the tie so the boundary is the last committed operation of that second rather than an arbitrary one.

Step-by-Step Execution Walkthrough

Confirm retention covers the target. Before invoking the resolver, verify the requested epoch is inside each shard’s oplog window; a target older than db.getReplicationInfo().timeDiff can never resolve and should be rejected at scheduling time.
Render the catalog credential from your secret store into the environment, never committing it alongside the drill definition.

Run the resolver for the shard, capturing its exit code:

bash

python3 mongo_pitr_resolver.py "2024-11-15T14:30:00Z" \
    "https://catalog.internal" "shard-rs0-primary.internal"; echo "exit=$?"

Branch on the exit code. 0 hands the emitted EXECUTE: line to the sandbox restore step; 1 halts the drill and escalates because the point is not recoverable; 2 aborts on a malformed invocation or catalog outage.
Dry-run before the full restore. Pass the emitted command through mongorestore with --dryRun first to verify archive integrity and oplog continuity, then execute the real restore inside the isolated environment.
Repeat per shard and synchronize the resolved boundaries so every shard replays to a mutually consistent instant before validation begins.

Verification and Expected Output

A successful resolution logs the boundary, prints two machine-parseable lines, and exits 0:

text

2026-07-05 04:12:01 | INFO | mongo_pitr_resolver | Resolved boundary: oplogLimit=1731681000:14
OPLOG_LIMIT: 1731681000:14
EXECUTE: mongorestore --host shard-rs0-primary.internal --port 27017 --oplogReplay --oplogLimit 1731681000:14 --archive=/backup/latest/shard-rs0-primary.internal.archive --numParallelCollections 4 --writeConcern 'majority'

An unreachable target emits a CRITICAL line and exits 1 — the drill must not proceed:

text

2026-07-05 04:12:01 | CRITICAL | mongo_pitr_resolver | Target unreachable: No oplog entry at or before 1731681000 for shard 'shard-rs0-primary.internal'. Verify retention window and catalog indexing.

The exit code is the contract the orchestrator reads:

0 — a committed boundary was resolved. Proceed to the sandbox restore.
1 — the target is unreachable (no entry at or before it, or outside retention). Halt the drill and escalate.
2 — malformed timestamp, catalog outage, or bad response. Abort the pipeline and fix the invocation.

Failure Modes and Troubleshooting

Symptom	Cause	Remediation
Exit `1`, `Target unreachable`	Requested epoch predates the oldest retained oplog entry	Size `oplogSizeMB` for peak write throughput; alert when `getReplicationInfo().timeDiff` drops below the drill’s lookback requirement
Boundary lands seconds before the target	Replica-set members drift and the ordinal reflects a stale second	Enforce `chrony` with `maxpoll 4` across all nodes; validate clock-skew metrics before the drill
`--oplogLimit` mismatch mid-restore	Primary failover during resolution moved the boundary	Route resolution and restore at a pinned secondary with `--readPreference secondary` for the drill window
Exit `2`, `Malformed catalog response`	Catalog returned entries without `ts.t`/`ts.i`, or a non-JSON body	Validate the catalog schema in CI; assert the `oplog_entries` contract before deploy
Exit `2`, `Catalog API request failed` after retries	Catalog service unreachable or 5xx beyond the retry budget	Check catalog health; the adapter already retries 500/502/503/504 with backoff, so a persistent failure is an infra incident
Restore truncates a transaction	Boundary aligned to a point after the last WiredTiger checkpoint	Align `--oplogLimit` to the latest checkpoint timestamp; verify `db.currentOp()` shows no active writes at the boundary

Post-restore, confirm the applied boundary matches the request before validation trusts the dataset. Cross-reference the catalog audit record and the replica’s own oplog view:

bash

# Applied oplog window on the restored sandbox member
mongosh --host "$VALIDATION_HOST" --eval "db.getReplicationInfo()"
# Catalog's record of the boundary this drill resolved
curl -s "${CATALOG_URL}/api/v1/audit/drill/${DRILL_ID}" | jq '.resolved_boundary'

Any deviation between requested and applied boundaries must trigger an alert and halt the pipeline — a drill that silently restored to the wrong instant is worse than one that failed loudly.

Integration Notes

The resolver is built for headless orchestration: it takes positional arguments, prints parseable lines, and returns strict POSIX codes. A thin wrapper turns the exit code into a restore decision or an escalation:

bash

#!/usr/bin/env bash
set -euo pipefail

OUT=$(python3 mongo_pitr_resolver.py "$TARGET_ISO" "$CATALOG_URL" "$SHARD")
case $? in
  0) echo "$OUT" | awk -F': ' '/^EXECUTE/{print $2}' > /run/drill/restore_cmd ;;
  1) curl -s -X POST "$PAGERDUTY_WEBHOOK" -d '{"event":"pitr_target_unreachable"}'; exit 1 ;;
  *) echo "[$(date -u)] resolver misconfigured; aborting"; exit 2 ;;
esac

Wire that wrapper into whichever scheduler owns the drill:

Airflow — invoke it from a BashOperator, or a PythonOperator that imports resolve_oplog_boundary directly and pushes the (seconds, ordinal) tuple to XCom so the downstream restore task consumes a typed value rather than scraping stdout. A non-zero exit fails the task and short-circuits the restore.
Celery — wrap the call in a task that raises on exit 1 so event-driven drills (fired when a fresh oplog batch lands) record the unreachable target and dispatch escalation with low latency.
cron — schedule the wrapper directly; strict POSIX codes let systemd OnFailure handlers route alerts without extra glue.

The resolved boundary becomes a deterministic input for the isolated restore that sandbox provisioning stands up, and its verified consistency feeds the downstream validation and smoke-test stages of restore drill orchestration and environment isolation. Feed the ISO target, resolved BSON timestamp, catalog-response hash, and executed command into an immutable audit sink so every recovery decision carries evidence of which instant was restored, when, and against which catalog version.

Frequently Asked Questions

Why select the entry at or before the target instead of the closest one?

Replaying past the requested instant would apply operations the recovery point is meant to exclude — for example the erroneous write a drill is trying to recover to just before. Selecting the greatest ts.t that is still <= target guarantees WiredTiger replays a contiguous, committed prefix of the oplog and stops exactly at or before the requested boundary, which is the only choice that preserves transactional consistency and audit meaning.

What does the ordinal (ts.i) actually protect against?

A busy primary can commit many operations within a single wall-clock second. The BSON timestamp disambiguates them with a low-order ordinal counter. Sorting on (ts.t, ts.i) rather than seconds alone ensures the boundary is the last committed operation of the target second, not an arbitrary earlier one, so --oplogLimit stops on a real commit rather than mid-second.

How do I keep a sharded restore consistent across shards?

Run the resolver per shard because each shard keeps an independent oplog window, then have the orchestrator reconcile the per-shard boundaries to a common instant before any restore begins. Pin resolution and restore to secondaries for the drill window so a mid-drill election cannot move a boundary underneath you, and reject the drill if any shard cannot reach the shared target.

When does the resolver exit 1 versus exit 2?

Exit 1 is a recovery verdict: the catalog responded correctly but the requested instant is not recoverable — no oplog entry at or before it, or the epoch is outside retention. Exit 2 is an invocation or infrastructure fault: a malformed timestamp, a catalog outage that survives the retry budget, or a response that violates the expected schema. Orchestrators should escalate 1 as "this point is gone" and treat 2 as "fix the pipeline and rerun."

Point-in-Time Recovery Targeting — the parent workflow this resolver plugs into as its MongoDB-native boundary stage.
Automating sandbox provisioning with Terraform — stands up the isolated cluster the emitted mongorestore command runs against.
Fallback chain design for Kubernetes clusters — how the drill routes when a boundary or restore path fails.
Smoke-test routing for microservice DR drills — the validation stage that consumes the restored, boundary-consistent dataset.
How to map RTO and RPO for PostgreSQL clusters — the recovery envelope that decides whether a resolved boundary is acceptable.

This script is one component of the broader Point-in-Time Recovery Targeting workflow.

For authoritative behavior of the underlying tooling, consult the MongoDB Manual: mongorestore; for timestamp parsing and executor lifecycle details, see the Python documentation for datetime.