Crash Recovery: Automatic Stream Restoration After Failures
How Vajra Cast automatically restarts routes after crashes with state persistence, PostgreSQL backing, and fast recovery times.
Why Crash Recovery Matters
Hardware reboots. Kernel panics happen. Power supplies fail. OOM killers terminate processes. In a 24/7 streaming environment, the question is not if your software will restart, but when, and what happens to your streams when it does.
Without automatic crash recovery, every restart means manual intervention: someone has to notice the problem, log in, and re-create the routes. At 3 AM on a Sunday, that could mean minutes or hours of downtime.
Vajra Cast was designed for unattended operation. When it starts (whether from a fresh boot, a process restart, or a container reschedule), it automatically restores all routes to their previous state and reconnects every stream.
How Crash Recovery Works
Vajra Cast’s crash recovery is built on a simple principle: the database is the source of truth, not the running process.
Every route, input, output, and configuration parameter is persisted to a PostgreSQL database the moment it is created or modified. The in-memory routing engine is a projection of the database state. It can be reconstructed at any time from the stored configuration.
The Recovery Sequence
When Vajra Cast starts up, the following sequence executes:
- Database connection. The application connects to PostgreSQL and verifies schema integrity.
- State load. All route configurations are read from the database: inputs, outputs, failover chains, transcoding profiles, audio mappings.
- Route reconstruction. The routing engine creates each route in memory based on the stored configuration.
- Listener binding. SRT and RTMP listeners bind to their configured ports and begin accepting connections.
- Caller initiation. SRT callers and RTMP push outputs begin connecting to their remote endpoints.
- Health monitoring start. The failover engine begins monitoring all inputs.
This entire sequence completes in seconds. From the moment the process starts to the moment streams are flowing again, the recovery time is typically under 5 seconds, often under 2 seconds for small to medium deployments.
What Gets Persisted
Everything that defines a route’s behavior is stored in the database:
| Component | Persisted Data |
|---|---|
| Ingests | Protocol, port, latency, encryption, stream ID |
| Outputs | Protocol, target URL, stream key, bitrate settings |
| Failover chains | Priority order, health thresholds, recovery settings |
| Transcoding profiles | Codec, resolution, bitrate, hardware acceleration |
| Audio matrix | Channel mappings, gain settings |
| Route state | Enabled/disabled status |
What is not persisted: transient runtime state like current bitrate measurements, packet counters, and active connection handles. These are reconstructed from live data as streams reconnect.
PostgreSQL: The Persistence Layer
Vajra Cast uses PostgreSQL as its state store. This is a deliberate choice over lighter alternatives like SQLite or embedded key-value stores:
Durability. PostgreSQL uses write-ahead logging (WAL) to ensure that committed transactions survive crashes. If Vajra Cast writes a route configuration and PostgreSQL acknowledges it, that data is safe, even if the server loses power in the next millisecond.
Concurrency. Multiple Vajra Cast components (the routing engine, the REST API, the web UI) can read and write the database simultaneously without corruption. MVCC (Multi-Version Concurrency Control) handles this transparently.
Operational maturity. PostgreSQL has decades of production use. Your operations team already knows how to back it up, replicate it, and monitor it. There is no proprietary database format to worry about.
External access. Because the state is in a standard PostgreSQL database, you can query it directly for reporting, auditing, or integration with external systems. The schema is documented and stable.
Database Deployment Patterns
For a single-server deployment, PostgreSQL runs alongside Vajra Cast on the same host. The Docker Compose configuration includes both services:
services:
vajracast:
image: vajracast/vajracast:latest
depends_on:
- postgres
environment:
DATABASE_URL: postgres://vajracast:secret@postgres:5432/vajracast
postgres:
image: postgres:16
volumes:
- pgdata:/var/lib/postgresql/data
environment:
POSTGRES_USER: vajracast
POSTGRES_PASSWORD: secret
POSTGRES_DB: vajracast
volumes:
pgdata:
For high-availability deployments, use an external PostgreSQL cluster (managed services like AWS RDS, Google Cloud SQL, or a self-managed Patroni cluster). This way, the database survives even if the entire Vajra Cast host fails.
Recovery Time
How fast does Vajra Cast recover? It depends on the scale of your deployment and the nature of the restart.
| Scenario | Typical Recovery Time |
|---|---|
| Process restart (same host) | 1-3 seconds |
| Container restart (Docker) | 2-5 seconds |
| Kubernetes pod reschedule | 5-15 seconds |
| Full host reboot | 30-60 seconds (OS boot + app start) |
| New host with external DB | 5-10 seconds |
The application startup itself is fast, under 2 seconds for most configurations. The dominant factor in recovery time is how long it takes for the environment (OS, container runtime, network) to become available.
Stream Reconnection
After Vajra Cast recovers, the streams need to reconnect:
- SRT Listeners: Available immediately. Remote callers will reconnect automatically as SRT’s connection management handles reconnection.
- SRT Callers: Vajra Cast initiates the outbound connection immediately on startup. The remote listener sees a new connection within seconds.
- RTMP Listeners: Available immediately. Encoders pushing RTMP typically retry on disconnect, so they reconnect within their retry interval (usually 2-5 seconds).
- RTMP Push Outputs: Vajra Cast reconnects to the remote RTMP server immediately on startup.
The end-to-end recovery (from crash to streams flowing again) is the sum of the application recovery time plus the stream reconnection time. For a well-configured system, this is typically under 10 seconds.
Crash Detection and Process Management
Vajra Cast does not manage its own restarts. Instead, it relies on the process manager or container orchestrator to detect crashes and restart the application:
- systemd: Configure
Restart=alwaysandRestartSec=1in the unit file. systemd detects the process exit and restarts it within 1 second. - Docker: Set
restart: unless-stoppedorrestart: alwaysin your Docker Compose file. The Docker daemon restarts the container on exit. - Kubernetes: The pod’s
restartPolicy: Always(the default) ensures the kubelet restarts the container. Liveness probes can detect hung processes that haven’t exited but are no longer functional.
This separation of concerns (Vajra Cast handles state recovery, the orchestrator handles process lifecycle) follows the Unix philosophy and avoids the complexity of self-healing daemons.
Health Checks
Vajra Cast exposes health check endpoints that process managers and load balancers can use to verify the application is running and functional:
GET /api/health
This endpoint returns HTTP 200 when the application is running and the database connection is healthy. Use it as a liveness probe in Kubernetes or a health check in Docker.
For deeper verification, the readiness endpoint confirms that the routing engine has finished loading:
GET /api/ready
This returns HTTP 200 only after all routes have been reconstructed and listeners are bound. Use it as a readiness probe in Kubernetes to prevent traffic from being routed to a still-initializing instance.
Designing for Recovery
To minimize the impact of crashes on your viewers:
- Use SRT for critical paths. SRT’s built-in reconnection and latency buffer absorbs brief outages gracefully.
- Configure output buffers. A 2-5 second output buffer on your CDN or downstream processor can hide brief interruptions during recovery.
- Use external PostgreSQL. If the database is on the same host and the host fails, you lose state. An external database eliminates this risk.
- Monitor and alert. Prometheus metrics and health check endpoints let you detect crashes instantly and verify recovery.
- Test recovery regularly. Kill the process during a test stream. Measure the time to full recovery. Make it part of your operational runbook.
Next Steps
- Return to the Video Stream Failover Guide for the complete reliability architecture
- Learn about Multi-Input Failover for automatic input switching
- Explore Docker and Kubernetes Deployment for production container orchestration