Model-based resilience

Model-based resilience from service-topology models

Bering discovers service topology and endpoint contracts, then publishes stable model artifacts. Sheaft consumes those artifacts to simulate failures, evaluate policy gates, and track posture over time.

This is a model-based signal: fast, repeatable, and designed to complement live experiments.

How it works

The workflow is intentionally split: discovery upstream, evaluation downstream.

  1. Bering discovers topology and endpoint contracts from traces / OTLP.

    Batch discovery works from trace files or topology inputs. Runtime mode accepts OTLP/HTTP and optional OTLP/gRPC.

  2. Bering publishes stable model and snapshot artifacts.

    Artifacts become explicit handoff contracts for downstream tools and CI workflows.

  3. Sheaft consumes artifacts, simulates scenarios, evaluates gate policy, and can monitor posture continuously.

    Use batch mode for pull requests and release checks, then switch to watch/serve mode for posture tracking.

Why not chaos-first

Live chaos engineering remains valuable. The point is sequencing, not replacement.

  • Live experiments are expensive to run frequently across many services and release branches.
  • A model-based gate is cheaper and can run regularly before release.
  • Use model-based results to narrow and prioritize where live experiments will matter most.
  • This adds a pre-release layer; it does not claim to replace production validation.

Two products, two roles

Bering: discovery and publishing layer

Bering is responsible for producing topology/contracts artifacts and publishing them as stable inputs.

  • Discovers service topology from traces, OTLP streams, or explicit topology input.
  • Builds endpoint contract and dependency artifacts for downstream consumers.
  • Supports deterministic batch mode and long-running runtime mode.
  • Does not own simulation, gating policy, or chaos execution.

Sheaft: simulation, gating, and posture monitoring

Sheaft stays downstream of discovery and evaluates resilience posture from upstream model artifacts.

  • Consumes published model/snapshot artifacts from Bering or compatible producers.
  • Runs simulation-based resilience analysis across CI/CD and adjacent engineering workflows.
  • Evaluates gate policy for release decisions.
  • Can run as a long-lived service for posture history, diffs, and metrics.

Who it is for

Teams that need a practical resilience signal before release, not a once-a-quarter exercise.

Core fit

  • Platform, SRE, and DevOps teams responsible for delivery risk.
  • Microservice teams already collecting traces/OTLP or topology artifacts.
  • Organizations that want a low-cost pre-release resilience checkpoint.

SMB / lean teams

  • A cheaper entry path into resilience practice.
  • Run small, frequent checks in CI before spending on heavy live tests.

Larger organizations

  • Prioritize where costly live chaos experiments should run first.
  • Standardize artifact handoff between discovery and gate evaluation teams.

Quick start

  1. Install Bering and generate a first model artifact

    Use the install guide, then run discovery on sample traces or your own input. Bering install guide

  2. Publish or export the resulting model/snapshot artifact

    Use sample folders to understand expected artifact shape and handoff format. Bering examples

  3. Run Sheaft simulation and CI gate on that artifact

    Start with the CI gate doc and the example CI assets. Sheaft CI examples

  4. Move to continuous posture monitoring when needed

    When batch checks are stable, enable long-running posture tracking. Sheaft observability mode

Resilience posture in observability mode

After CI gate is in place, you can run Sheaft continuously to track posture drift instead of checking only at release time.

  1. Bering keeps publishing rolling snapshots

    In runtime mode, Bering emits model/snapshot artifacts per time window from OTLP and trace input.

  2. Sheaft serve/watch consumes each new artifact

    Every new artifact is re-evaluated against analysis rules and gate policy without waiting for the next release.

  3. Posture history and diffs show what changed

    Teams can inspect when posture moved, what dependencies shifted, and which policies flipped state.

  4. Use the signal for prioritization

    Prioritize incidents, hardening work, and expensive live chaos runs using the latest posture trend.

What teams get in practice

  • Current posture status and when it last changed.
  • Diff/history context for topology and policy outcomes.
  • A cheaper continuous resilience signal between live experiments.

Demo-report

Roadmap

Planned product milestones and research-to-product transition phases.