Why Crash Recovery Matters

Hardware reboots. Kernel panics happen. Power supplies fail. OOM killers terminate processes. In a 24/7 streaming environment, the question is not if your software will restart, but when, and what happens to your streams when it does.

Without automatic crash recovery, every restart means manual intervention: someone has to notice the problem, log in, and re-create the routes. At 3 AM on a Sunday, that could mean minutes or hours of downtime.

Vajra Cast was designed for unattended operation. When it starts (whether from a fresh boot, a process restart, or a container reschedule), it automatically restores all routes to their previous state and reconnects every stream.

How Crash Recovery Works

Vajra Cast’s crash recovery is built on a simple principle: the database is the source of truth, not the running process.

Every route, input, output, and configuration parameter is persisted to a PostgreSQL database the moment it is created or modified. The in-memory routing engine is a projection of the database state. It can be reconstructed at any time from the stored configuration.

The Recovery Sequence

When Vajra Cast starts up, the following sequence executes:

  1. Database connection. The application connects to PostgreSQL and verifies schema integrity.
  2. State load. All route configurations are read from the database: inputs, outputs, failover chains, transcoding profiles, audio mappings.
  3. Route reconstruction. The routing engine creates each route in memory based on the stored configuration.
  4. Listener binding. SRT and RTMP listeners bind to their configured ports and begin accepting connections.
  5. Caller initiation. SRT callers and RTMP push outputs begin connecting to their remote endpoints.
  6. Health monitoring start. The failover engine begins monitoring all inputs.

This entire sequence completes in seconds. From the moment the process starts to the moment streams are flowing again, the recovery time is typically under 5 seconds, often under 2 seconds for small to medium deployments.

What Gets Persisted

Everything that defines a route’s behavior is stored in the database:

ComponentPersisted Data
IngestsProtocol, port, latency, encryption, stream ID
OutputsProtocol, target URL, stream key, bitrate settings
Failover chainsPriority order, health thresholds, recovery settings
Transcoding profilesCodec, resolution, bitrate, hardware acceleration
Audio matrixChannel mappings, gain settings
Route stateEnabled/disabled status

What is not persisted: transient runtime state like current bitrate measurements, packet counters, and active connection handles. These are reconstructed from live data as streams reconnect.

PostgreSQL: The Persistence Layer

Vajra Cast uses PostgreSQL as its state store. This is a deliberate choice over lighter alternatives like SQLite or embedded key-value stores:

Durability. PostgreSQL uses write-ahead logging (WAL) to ensure that committed transactions survive crashes. If Vajra Cast writes a route configuration and PostgreSQL acknowledges it, that data is safe, even if the server loses power in the next millisecond.

Concurrency. Multiple Vajra Cast components (the routing engine, the REST API, the web UI) can read and write the database simultaneously without corruption. MVCC (Multi-Version Concurrency Control) handles this transparently.

Operational maturity. PostgreSQL has decades of production use. Your operations team already knows how to back it up, replicate it, and monitor it. There is no proprietary database format to worry about.

External access. Because the state is in a standard PostgreSQL database, you can query it directly for reporting, auditing, or integration with external systems. The schema is documented and stable.

Database Deployment Patterns

For a single-server deployment, PostgreSQL runs alongside Vajra Cast on the same host. The Docker Compose configuration includes both services:

services:
  vajracast:
    image: vajracast/vajracast:latest
    depends_on:
      - postgres
    environment:
      DATABASE_URL: postgres://vajracast:secret@postgres:5432/vajracast

  postgres:
    image: postgres:16
    volumes:
      - pgdata:/var/lib/postgresql/data
    environment:
      POSTGRES_USER: vajracast
      POSTGRES_PASSWORD: secret
      POSTGRES_DB: vajracast

volumes:
  pgdata:

For high-availability deployments, use an external PostgreSQL cluster (managed services like AWS RDS, Google Cloud SQL, or a self-managed Patroni cluster). This way, the database survives even if the entire Vajra Cast host fails.

Recovery Time

How fast does Vajra Cast recover? It depends on the scale of your deployment and the nature of the restart.

ScenarioTypical Recovery Time
Process restart (same host)1-3 seconds
Container restart (Docker)2-5 seconds
Kubernetes pod reschedule5-15 seconds
Full host reboot30-60 seconds (OS boot + app start)
New host with external DB5-10 seconds

The application startup itself is fast, under 2 seconds for most configurations. The dominant factor in recovery time is how long it takes for the environment (OS, container runtime, network) to become available.

Stream Reconnection

After Vajra Cast recovers, the streams need to reconnect:

  • SRT Listeners: Available immediately. Remote callers will reconnect automatically as SRT’s connection management handles reconnection.
  • SRT Callers: Vajra Cast initiates the outbound connection immediately on startup. The remote listener sees a new connection within seconds.
  • RTMP Listeners: Available immediately. Encoders pushing RTMP typically retry on disconnect, so they reconnect within their retry interval (usually 2-5 seconds).
  • RTMP Push Outputs: Vajra Cast reconnects to the remote RTMP server immediately on startup.

The end-to-end recovery (from crash to streams flowing again) is the sum of the application recovery time plus the stream reconnection time. For a well-configured system, this is typically under 10 seconds.

Crash Detection and Process Management

Vajra Cast does not manage its own restarts. Instead, it relies on the process manager or container orchestrator to detect crashes and restart the application:

  • systemd: Configure Restart=always and RestartSec=1 in the unit file. systemd detects the process exit and restarts it within 1 second.
  • Docker: Set restart: unless-stopped or restart: always in your Docker Compose file. The Docker daemon restarts the container on exit.
  • Kubernetes: The pod’s restartPolicy: Always (the default) ensures the kubelet restarts the container. Liveness probes can detect hung processes that haven’t exited but are no longer functional.

This separation of concerns (Vajra Cast handles state recovery, the orchestrator handles process lifecycle) follows the Unix philosophy and avoids the complexity of self-healing daemons.

Health Checks

Vajra Cast exposes health check endpoints that process managers and load balancers can use to verify the application is running and functional:

GET /api/health

This endpoint returns HTTP 200 when the application is running and the database connection is healthy. Use it as a liveness probe in Kubernetes or a health check in Docker.

For deeper verification, the readiness endpoint confirms that the routing engine has finished loading:

GET /api/ready

This returns HTTP 200 only after all routes have been reconstructed and listeners are bound. Use it as a readiness probe in Kubernetes to prevent traffic from being routed to a still-initializing instance.

Designing for Recovery

To minimize the impact of crashes on your viewers:

  1. Use SRT for critical paths. SRT’s built-in reconnection and latency buffer absorbs brief outages gracefully.
  2. Configure output buffers. A 2-5 second output buffer on your CDN or downstream processor can hide brief interruptions during recovery.
  3. Use external PostgreSQL. If the database is on the same host and the host fails, you lose state. An external database eliminates this risk.
  4. Monitor and alert. Prometheus metrics and health check endpoints let you detect crashes instantly and verify recovery.
  5. Test recovery regularly. Kill the process during a test stream. Measure the time to full recovery. Make it part of your operational runbook.

Next Steps