Video Stream Failover: Best Practices for Zero-Downtime Broadcasting

Why Failover Matters

In live broadcasting, a dropped stream isn’t just a technical issue. It’s lost audience, lost revenue, and damaged reputation. From a sports event with 50,000 viewers to a corporate town hall with 500 employees, the expectation is the same: it must not go down.

Video stream failover is the safety net that catches your broadcast when the primary feed fails.

What is Video Failover?

Failover is the automatic switching from a primary video input to a backup when the system detects a failure. A good failover system:

  • Detects failure fast: milliseconds, not seconds
  • Switches cleanly: minimal visual disruption for viewers
  • Recovers automatically: returns to the primary when it’s healthy again
  • Requires no manual intervention: the whole point is automation

Architecture: Redundant Inputs

The foundation of any failover setup is redundant inputs. You need at least two independent paths:

Active/Standby

The simplest model. One input is active, the other is hot standby:

Primary SRT → [Gateway] → Output
Backup RTMP → [Gateway] ↗ (on failure)
  • Primary carries the stream
  • Backup is connected and ready but not used
  • On primary failure, gateway switches to backup

Active/Active

Both inputs carry the stream simultaneously. The gateway selects the best one:

Input A (SRT) → [Gateway: compare] → Best signal → Output
Input B (SRT) → [Gateway: compare] ↗
  • Both paths are monitored in real-time
  • Gateway can switch based on quality, not just connectivity
  • More bandwidth cost, but higher reliability

Detection: How Fast Can You React?

The speed of failover depends on how quickly you detect the problem. Common detection methods:

Stream Health Monitoring

Monitor the incoming stream for:

  • Packet loss: SRT reports this in real-time
  • Bitrate drops: sudden bitrate decrease often precedes a full failure
  • Black/frozen frames: content-aware detection (advanced)
  • Audio silence: loss of audio signal

Timeouts

Set aggressive but realistic timeouts:

Detection MethodTypical TimeoutNotes
SRT packet loss<50msSRT statistics report instantly
TCP disconnect1-5sTCP timeout dependent
Bitrate threshold200-500msConfigurable window
Content analysis500ms-2sCompute intensive

The 50ms Target

Professional broadcast equipment targets sub-50ms failover. This means:

  1. Failure detected within 20ms
  2. Switch command issued within 10ms
  3. Output buffer absorbs the transition within 20ms

At 50ms, the switch is invisible to viewers, happening within 1-2 video frames.

Implementation Patterns

Pattern 1: Gateway-Level Failover

The gateway itself handles failover logic. This is the simplest and most reliable approach.

Vajra Cast implements this natively:

  • Configure primary and backup inputs
  • Set detection thresholds (packet loss %, bitrate floor, timeout)
  • The gateway switches automatically and logs every event
  • When primary recovers, it switches back (configurable)

Pattern 2: Encoder-Level Redundancy

Run two encoders independently, each sending to the gateway:

Camera → Encoder A → SRT → Gateway
Camera → Encoder B → SRT → Gateway (backup)

This protects against encoder failure, not just network failure.

Pattern 3: Geographic Redundancy

For mission-critical broadcasts, distribute across locations:

Venue Encoder → SRT → Gateway (Region A)
Venue Encoder → SRT → Gateway (Region B) [failover]

Both gateways output to CDN. The CDN-level origin failover provides the final layer of protection.

Monitoring and Alerts

Failover without monitoring is flying blind. Set up:

  1. Real-time dashboards: visualize all input health metrics simultaneously
  2. Automated alerts: get notified when failover activates (Slack, email, webhook)
  3. Event logging: timestamp every switch event for post-mortem analysis
  4. Recovery notifications: know when the primary is back and stable

Testing Your Failover

Never trust a failover system you haven’t tested. Test regularly:

  • Scheduled drills: pull the primary cable during a test stream
  • Network simulation: inject packet loss with tools like tc to test SRT recovery vs. failover threshold
  • Encoder failure: kill the encoder process and measure switch time
  • Recovery testing: verify the system returns to primary after a failure
  • Load testing: confirm failover works under peak output conditions

Common Mistakes

  1. Single point of failure in the switch itself: if your failover device fails, everything fails. Use a proven, hardened gateway.
  2. Backup feed not monitored: your backup might be dead when you need it. Monitor both inputs at all times.
  3. Too-aggressive timeouts: switching on momentary packet loss creates unnecessary disruption. Tune your thresholds.
  4. No automatic recovery: manual “switch back to primary” means someone has to be awake at 3 AM.
  5. Not testing: the first time your failover fires shouldn’t be during a live event.

The Vajra Cast Advantage

Vajra Cast was designed with failover as a core feature, not an afterthought:

  • Multi-input failover with configurable priority chains
  • Sub-50ms switching on SRT inputs
  • Real-time health monitoring with per-input metrics
  • Automatic recovery with configurable hold-off timers
  • Full event logging for every failover event
  • Protocol-agnostic: failover works across SRT, RTMP, and HLS inputs