Video Stream Failover: Complete Guide to Zero-Downtime Streaming
Learn how video stream failover works, why it matters for live broadcasts, and how to implement automatic failover with Vajra Cast for zero-downtime streaming.
What Is Video Stream Failover?
Video stream failover is the automatic process of switching from a failed or degraded video source to a backup source without interrupting the output stream. When a primary input drops (whether due to encoder failure, network outage, or signal degradation), the failover system detects the problem and routes a backup source to the output in its place.
For viewers, the goal is invisibility. A properly implemented failover switch should be imperceptible: no black frames, no buffering spinner, no interruption. The stream simply continues as though nothing happened.
Failover is not optional for professional broadcasting. Every live production that matters (sports coverage, news broadcasts, corporate events, 24/7 channels) relies on some form of failover protection. The question is not whether you need it, but how to implement it correctly.
Why Failover Matters More Than Ever
The economics of live streaming have changed. A decade ago, a dropped stream was an inconvenience. Today, it is a direct financial loss:
- Advertising revenue evaporates the moment viewers leave a broken stream
- Platform algorithms penalize channels with reliability issues, reducing future discoverability
- Contractual SLAs in enterprise and sports broadcasting carry financial penalties for downtime
- Brand reputation takes a hit that no post-mortem can fully repair
The shift to IP-based transport (away from dedicated SDI circuits) has increased both the opportunity and the risk. IP networks are cheaper and more flexible, but they introduce failure modes that dedicated circuits never had: packet loss, route changes, congestion, and endpoint crashes. Failover is the mechanism that makes IP transport trustworthy enough for mission-critical broadcasting.
Types of Failover: Hot, Warm, and Cold Standby
Not all failover is created equal. The three standard approaches differ in readiness, cost, and switching speed.
Hot Standby
In a hot standby configuration, the backup source is fully active and synchronized with the primary. Both sources are receiving, decoding, and buffering simultaneously. When the primary fails, the switch is instantaneous because the backup is already running.
Characteristics:
- Switching time: sub-50ms (total failover including detection: under 200ms)
- Resource cost: 2x the ingest bandwidth and processing
- Reliability: highest. Backup is proven live before it is needed
- Use case: mission-critical broadcasts where any interruption is unacceptable
Hot standby is what Vajra Cast implements by default. Every input in a failover chain is actively monitored and pre-buffered, so the switch happens in the time it takes to redirect an internal pointer, not the time it takes to establish a new connection.
Warm Standby
In warm standby, the backup source is connected but not fully active. The connection is established and periodically validated, but the system is not continuously decoding the full stream. On failover, there is a brief initialization period.
Characteristics:
- Switching time: 500ms to 2 seconds
- Resource cost: lower than hot standby (connection overhead only)
- Reliability: good, but there is a visible transition
- Use case: secondary feeds, non-critical streams, cost-sensitive deployments
Cold Standby
Cold standby means the backup source is configured but not connected. On primary failure, the system initiates a new connection from scratch: DNS resolution, TCP/UDP handshake, stream negotiation, and buffering.
Characteristics:
- Switching time: 2 to 10+ seconds
- Resource cost: minimal until failover triggers
- Reliability: lowest. The backup path is untested until it is needed
- Use case: disaster recovery, where some downtime is acceptable
For professional broadcasting, hot standby is the only option that meets audience expectations. Cold standby is better suited for background infrastructure (e.g., failing over a recording server) where a few seconds of gap is tolerable.
How Vajra Cast Implements Failover
Vajra Cast was designed with failover as a core architectural component, not an afterthought bolted onto a routing engine. Here is how it works under the hood.
Priority Chains
Every route in Vajra Cast can have multiple inputs arranged in a priority chain. The input with the highest priority is the preferred source. If it fails, the system automatically switches to the next input in the chain.
Priority 1: SRT Listener (main encoder) ← active
Priority 2: SRT Caller (backup encoder) ← hot standby
Priority 3: RTMP (cloud encoder) ← hot standby
Priority 4: HTTP/TS (slate/fallback) ← hot standby
There is no limit to the number of inputs in a chain. Each input is independently monitored, and the system always selects the highest-priority healthy input.
Health Monitoring
Vajra Cast continuously evaluates the health of every input using multiple signals:
- Connection state: is the source connected and delivering data?
- Bitrate analysis: is the bitrate within expected range, or has it dropped below a configurable threshold?
- Packet loss rate: for SRT inputs, is loss exceeding the recovery capacity?
- Continuity counters: are MPEG-TS continuity counters incrementing correctly, or are there gaps?
- Timeout detection: has data stopped arriving entirely?
Each health signal has a configurable threshold and hysteresis window. This prevents false failovers caused by momentary network glitches. For example, you might configure: “fail over if packet loss exceeds 15% for more than 300ms continuously.”
Sub-200ms Switching
When a failover condition is detected, the switch happens in three phases:
- Detection (configurable, typically 50-100ms): health metrics cross the threshold for the configured duration
- Decision (under 1ms): the routing engine selects the next healthy input from the priority chain
- Switching (under 1ms): the internal stream pointer redirects to the backup input’s pre-buffered data
Because backup inputs are already ingested, decoded, and buffered in hot standby, the actual switch is a pointer operation. There is no connection negotiation, no buffering delay, no codec initialization. The output continues with data from the backup source on the very next packet.
Total failover time: under 200ms in worst case, typically under 100ms. At 30fps, that is 3-6 frames, imperceptible to viewers.
Automatic Recovery
When the primary input recovers (reconnects and delivers healthy data), Vajra Cast can automatically switch back. This behavior is configurable:
- Auto-recover: ON: switch back to the higher-priority input after a configurable hold-off period (e.g., 10 seconds of stable health)
- Auto-recover: OFF: stay on the backup until an operator manually switches back
- Hold-off timer: prevents flapping when a source is intermittently failing
The hold-off timer is critical. Without it, a source that is bouncing between connected and disconnected will cause rapid switching (flapping) that is worse than staying on the backup.
Protocol-Agnostic Failover
One of Vajra Cast’s architectural advantages is that failover works across protocols. The priority chain can mix any combination of supported input protocols:
| Priority | Protocol | Source | Notes |
|---|---|---|---|
| 1 | SRT (listener) | Main encoder on-site | Lowest latency, AES-256 encrypted |
| 2 | SRT (caller) | Backup encoder on-site | Independent network path |
| 3 | SRTLA | Mobile encoder via cellular | Bonded 4G/5G connection |
| 4 | RTMP | Cloud encoder | Legacy compatibility |
| 5 | HTTP/TS | Static slate file | ”We’ll be right back” card |
This flexibility is essential for real-world deployments where not every source uses the same protocol. A remote contributor might send RTMP because their encoder does not support SRT. A mobile unit uses SRTLA for cellular bonding. The on-site encoder uses SRT for optimal performance. Vajra Cast treats them all equally in the failover chain.
For a deeper comparison of SRT and RTMP and when to use each, see SRT vs RTMP: Which Streaming Protocol Should You Use?.
Real-World Failover Use Cases
Live Sports Broadcasting
Sports broadcasting is the most demanding failover scenario. A dropped feed during a goal, a touchdown, or a race finish is unrecoverable. The moment is gone, and no replay can substitute for the live experience.
Typical configuration:
- Primary: SRT from on-site production truck
- Backup 1: SRT from a second encoder on an independent network path (separate ISP or dedicated circuit)
- Backup 2: SRTLA from a bonded cellular unit as a last resort
- Backup 3: Static slate with “Technical difficulties” overlay
Vajra Cast’s priority chain handles this natively. The system runs all four inputs in hot standby, monitoring each one continuously. If the primary encoder crashes, the switch to Backup 1 happens in under 100ms. If the entire venue loses internet, the SRTLA cellular backup takes over. If even cellular fails, viewers see the slate rather than a broken player.
We have been running 40+ routes in this configuration for live sports production, 24/7. The system has been tested in real conditions, not just lab environments. For a deeper look at failover architectures for sports production, see our live sports broadcasting guide.
24/7 Linear Channels
Channels that broadcast around the clock (news networks, music channels, religious programming) cannot afford any downtime. Unlike event-based production where there is a defined start and end, 24/7 channels must survive every possible failure scenario across weeks and months.
Typical configuration:
- Primary: SRT from the playout server
- Backup 1: SRT from a redundant playout server
- Backup 2: HTTP/TS pull from a pre-programmed playlist server
- Failover is combined with crash recovery. If the Vajra Cast process itself restarts, it rebuilds all routes automatically in under 5 seconds
The crash recovery feature is especially important here. In a 24/7 environment, the gateway must survive not just input failures but its own restarts (OS updates, process crashes, hardware maintenance). Vajra Cast’s process adoption system detects running FFmpeg processes after a restart and reconnects to them without interrupting the output streams.
Remote Production (REMI)
Remote production moves the production control room away from the venue. Camera feeds are sent over IP to a central facility where switching, graphics, and distribution happen. This model relies entirely on reliable transport, and failover is the safety net.
Typical configuration:
- Primary: SRT from each camera encoder at the venue
- Backup: SRTLA bonded cellular as a secondary path per camera
- Return feed: SRT back to the venue for IFB (interruptible foldback) and confidence monitoring
In REMI workflows, every camera is an independent failover chain. Vajra Cast handles this by creating separate routes for each camera, each with its own priority chain and health monitoring. For real-world REMI deployment strategies including Starlink connectivity, see our remote production with SRT guide. The diagram view in the UI makes it straightforward to visualize and manage dozens of routes simultaneously.
Monitoring and Alerting for Failover Events
Failover that you cannot observe is failover you cannot trust. Effective monitoring has three layers:
Real-Time Dashboard
Vajra Cast’s web interface shows the status of every input in every route:
- Green: healthy, active
- Yellow: connected but degraded (high loss, low bitrate)
- Red: disconnected or failed
- Active indicator showing which input in the priority chain is currently feeding the output
The diagram view provides a visual map of all routes, with real-time status overlays on every connection.
Prometheus Metrics
Vajra Cast exposes 50+ metrics via a /metrics endpoint compatible with Prometheus. Failover-related metrics include:
vajracast_input_status{route="sports_main", input="primary"} 1
vajracast_input_status{route="sports_main", input="backup1"} 1
vajracast_failover_events_total{route="sports_main"} 3
vajracast_failover_last_timestamp{route="sports_main"} 1707523200
vajracast_input_bitrate_bps{route="sports_main", input="primary"} 8500000
vajracast_input_packet_loss{route="sports_main", input="primary"} 0.002
These metrics can be graphed in Grafana (pre-built dashboards are included) and used to trigger alerts via Alertmanager. For example: “Alert if any route has executed more than 2 failover events in the past hour.”
Event Logging and Webhooks
Every failover event is logged with:
- Timestamp
- Route name
- Source input (which failed)
- Target input (which took over)
- Reason (timeout, packet loss threshold, bitrate drop, manual switch)
- Duration on backup before recovery
This log is invaluable for post-event analysis. If failover triggered during a broadcast, you can trace exactly what happened, when, and why.
Best Practices for Configuring Failover
1. Use Independent Network Paths
If your primary and backup inputs share the same network switch, ISP, or cable run, a single network failure takes out both. True redundancy requires independent paths:
- Different ISPs for primary and backup
- Different physical network interfaces
- Different cable runs (separate conduit)
- For cellular backup, different carriers
2. Test Your Failover Regularly
A failover system that has never been tested is not a failover system. It is a hope. Schedule regular failover drills:
- Pull the primary encoder’s network cable during a test stream
- Kill the encoder process and measure switch time
- Inject packet loss using network simulation tools (
tc netemon Linux) to test threshold detection - Verify that auto-recovery works when the primary comes back
Test under load. Failover behavior can differ when the system is handling 50 routes versus 2.
3. Tune Your Thresholds
Default thresholds are a starting point. Tune them based on your specific environment:
- Timeout too aggressive (e.g., 50ms): causes false failovers on momentary network jitter
- Timeout too conservative (e.g., 5 seconds): viewers see 5 seconds of broken video before the switch
- Recommended starting point: 200-500ms timeout, 10% packet loss threshold, 50% bitrate floor
Monitor your failover event log. If you see frequent failovers followed by immediate recovery, your thresholds are too aggressive.
4. Always Have a Static Fallback
The last input in your priority chain should be something that cannot fail: a static slate image, a pre-recorded loop, or a “we’ll be right back” card served from local storage. This guarantees that even in a catastrophic scenario where all live sources fail, viewers see something intentional rather than a broken player.
5. Monitor Your Backup Sources
A backup source that is offline when you need it is worthless. Hot standby monitoring is not just about readiness. It is about continuously validating that the backup is healthy. Vajra Cast monitors all inputs in a priority chain equally, whether they are active or on standby. If your backup goes down, you know immediately, not when the primary fails and the backup fails to take over.
6. Plan for Gateway-Level Redundancy
Failover protects against input failure. But what about gateway failure? For the highest reliability, run two Vajra Cast instances:
- Primary gateway handles all production routes
- Secondary gateway mirrors the configuration and can take over via DNS failover or load balancer health checks
- Both instances can use the same Docker/Kubernetes deployment infrastructure
How Vajra Cast Compares to Other Failover Solutions
| Feature | Vajra Cast | Hardware Switcher | Cloud Failover (AWS) | Manual Switching |
|---|---|---|---|---|
| Switching speed | <200ms | <50ms (frame-accurate) | 2-10s | 5-30s (human reaction) |
| Protocol support | SRT, RTMP, SRTLA, UDP, HTTP | SDI/HDMI only | RTMP, HLS | Any |
| Inputs per chain | Unlimited | 2-4 (hardware dependent) | Varies | N/A |
| Monitoring | Built-in + Prometheus | Typically minimal | CloudWatch | None |
| Cost | Software license | $5,000-$50,000+ | Per-minute compute | Labor cost |
| Remote management | Full web UI + REST API | Limited or none | AWS Console/API | Physical presence |
| Scalability | 50+ routes per instance | 1 route per device | Elastic but expensive | Not scalable |
Hardware switchers excel at frame-accurate switching for SDI workflows but cannot handle IP-based multi-protocol environments. Cloud solutions introduce latency and per-minute costs that add up fast. Manual switching is inherently unreliable because it depends on a human being awake, alert, and fast.
Vajra Cast occupies the middle ground: software-defined, IP-native, multi-protocol, and automated, at a fraction of the cost of hardware or cloud alternatives.
Putting It All Together
A complete failover setup in Vajra Cast follows this structure:
- Define your route: one output destination (e.g., SRT push to CDN)
- Add primary input: your main encoder, highest priority
- Add backup inputs: in priority order, each on an independent path
- Add a static fallback: lowest priority, guaranteed availability
- Configure health thresholds: timeout, packet loss, bitrate floor
- Set recovery behavior: auto-recover with hold-off timer, or manual
- Connect monitoring: Prometheus scraping, Grafana dashboards, alerting
- Test everything: simulate failures before going live
With this configuration, your stream is protected against encoder failure, network outage, protocol issues, and even complete venue connectivity loss. The system handles it all automatically, silently, and reliably.
For a step-by-step setup guide, see SRT Streaming Setup: From Zero to Production. For the broader architecture of stream routing and distribution, continue to Live Stream Routing: The Complete Guide.
Next Steps
- SRT Streaming Gateway: the complete guide to SRT-based video infrastructure
- Video Failover Best Practices: shorter, tactical guide to failover configuration
- SRT vs RTMP: understand the protocol trade-offs that affect failover performance
- Live Stream Routing: how to route, split, and manage video signals across your infrastructure
Frequently Asked Questions
What is video stream failover?
Video stream failover is an automatic mechanism that switches to a backup video source when the primary source fails, ensuring continuous streaming without interruption.
How fast should failover switching be?
Professional broadcast failover should switch in under 500ms. Vajra Cast achieves sub-50ms switchover by pre-buffering backup sources in hot standby, with total end-to-end failover (including detection) under 200ms.
Can I have multiple backup sources?
Yes. Vajra Cast supports N+1 redundancy with unlimited backup sources in a priority chain. Each source is independently monitored with configurable health thresholds.
Does failover work with different protocols?
Absolutely. Vajra Cast can failover between any combination of SRT, RTMP, SRTLA, UDP, and HTTP sources. Protocol-agnostic failover means maximum flexibility.