Real-Time Metrics: Prometheus and Grafana Monitoring

Why Monitoring Matters for Live Streaming

A live stream either works or it does not. There is no “deploy and forget.” Network conditions change, encoders misbehave, CDN ingest points go down, and hardware degrades over time. Without real-time monitoring, you find out about problems when your viewers complain, or worse, when they leave.

Vajra Cast includes a built-in metrics system that exposes every meaningful measurement about your streaming infrastructure. These metrics integrate with the industry-standard Prometheus and Grafana stack, giving you dashboards, alerting, and historical analysis out of the box.

Built-In Metrics

Vajra Cast exposes metrics at the /metrics endpoint in Prometheus exposition format:

http://your-server:8080/metrics

No authentication is required for the metrics endpoint by default (configurable). Prometheus scrapes this endpoint at a regular interval (typically 15 seconds) and stores the time-series data.

Stream Metrics

Per-route, per-input, and per-output metrics:

Metric	Type	Description
`vajracast_input_bitrate_bps`	Gauge	Current input bitrate in bits per second
`vajracast_output_bitrate_bps`	Gauge	Current output bitrate in bits per second
`vajracast_input_packets_total`	Counter	Total packets received per input
`vajracast_output_packets_total`	Counter	Total packets sent per output
`vajracast_input_connected`	Gauge	1 if input is connected, 0 if not
`vajracast_output_connected`	Gauge	1 if output is connected, 0 if not
`vajracast_failover_switches_total`	Counter	Total failover switch events per route
`vajracast_active_input_priority`	Gauge	Priority number of the currently active input

SRT-Specific Metrics

For SRT inputs and outputs, additional protocol-level metrics:

Metric	Type	Description
`vajracast_srt_rtt_ms`	Gauge	Round-trip time in milliseconds
`vajracast_srt_packet_loss_percent`	Gauge	Current packet loss percentage
`vajracast_srt_retransmit_rate`	Gauge	Retransmission rate
`vajracast_srt_recv_buffer_ms`	Gauge	Receive buffer level in milliseconds
`vajracast_srt_send_buffer_ms`	Gauge	Send buffer level in milliseconds
`vajracast_srt_bandwidth_mbps`	Gauge	Estimated available bandwidth

System Metrics

Infrastructure-level measurements:

Metric	Type	Description
`vajracast_cpu_usage_percent`	Gauge	Application CPU usage
`vajracast_memory_usage_bytes`	Gauge	Application memory usage
`vajracast_gpu_encoder_usage_percent`	Gauge	Hardware encoder utilization
`vajracast_routes_active`	Gauge	Number of active routes
`vajracast_uptime_seconds`	Gauge	Application uptime

All metrics include labels for route name, input/output ID, and protocol, allowing you to filter and aggregate across your entire deployment.

Prometheus Configuration

Add Vajra Cast to your Prometheus scrape configuration:

# prometheus.yml
scrape_configs:
  - job_name: 'vajracast'
    scrape_interval: 10s
    static_configs:
      - targets: ['vajracast-host:8080']
        labels:
          environment: 'production'
          location: 'studio-a'

For multiple Vajra Cast instances, add each as a target:

    static_configs:
      - targets:
          - 'vajracast-studio-a:8080'
          - 'vajracast-studio-b:8080'
          - 'vajracast-remote:8080'

If you are running Vajra Cast in Docker or Kubernetes, use service discovery instead of static targets. Prometheus supports Docker and Kubernetes service discovery natively.

Grafana Dashboards

Grafana transforms raw Prometheus metrics into visual dashboards. A well-designed streaming dashboard shows you the health of your entire infrastructure at a glance.

Recommended Dashboard Panels

Overview row:

Total active routes (single stat)
Total connected inputs / total inputs (ratio)
Total connected outputs / total outputs (ratio)
System CPU and memory usage (gauges)

Per-route row (repeated for each route):

Input bitrate over time (graph)
Output bitrate over time (graph)
Active input indicator (showing which priority level is active)
Failover event markers (annotations on the graph)

SRT health row:

RTT over time per SRT connection (graph)
Packet loss percentage per SRT connection (graph)
Receive buffer level (graph, should stay below 80% of configured latency)
Retransmission rate (graph)

Transcoding row (if using hardware transcoding):

GPU encoder utilization (graph)
Encoding FPS vs. target FPS (should always be at or above target)
Output quality metrics (if VMAF is enabled)

Example PromQL Queries

Average input bitrate across all routes:

avg(vajracast_input_bitrate_bps) / 1e6

Maximum packet loss across all SRT connections:

max(vajracast_srt_packet_loss_percent)

Failover events in the last hour:

increase(vajracast_failover_switches_total[1h])

Routes where the active input is not priority 1 (failover is active):

vajracast_active_input_priority > 1

Alerting

Prometheus Alertmanager and Grafana both support alerting rules. For live streaming, set up alerts for conditions that require immediate attention.

Critical Alerts (Page Someone)

# prometheus-rules.yml
groups:
  - name: vajracast-critical
    rules:
      - alert: StreamDisconnected
        expr: vajracast_input_connected == 0 and vajracast_output_connected == 1
        for: 30s
        labels:
          severity: critical
        annotations:
          summary: "Input disconnected on route {{ $labels.route }}"

      - alert: AllInputsDown
        expr: sum by (route) (vajracast_input_connected) == 0
        for: 10s
        labels:
          severity: critical
        annotations:
          summary: "All inputs down on route {{ $labels.route }}"

Warning Alerts (Notify the Team)

      - alert: FailoverActive
        expr: vajracast_active_input_priority > 1
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "Failover active on route {{ $labels.route }}, using priority {{ $value }}"

      - alert: HighPacketLoss
        expr: vajracast_srt_packet_loss_percent > 5
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "SRT packet loss above 5% on {{ $labels.route }}"

      - alert: LowBitrate
        expr: vajracast_input_bitrate_bps < 500000
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "Input bitrate below 500kbps on route {{ $labels.route }}"

Route these alerts to Slack, PagerDuty, email, or any other notification channel through Alertmanager.

VMAF Quality Analysis

For transcoded streams, bitrate alone does not tell you whether the output looks good. VMAF (Video Multimethod Assessment Fusion) is Netflix’s open-source video quality metric that correlates closely with human perception.

Vajra Cast can compute VMAF scores in real-time by comparing the transcoded output against the original input:

VMAF score range: 0-100 (higher is better)
93+: Excellent. Visually indistinguishable from the source
80-93: Good. Minor differences visible only in side-by-side comparison
70-80: Fair. Noticeable artifacts under scrutiny
Below 70: Poor. Visible quality loss

VMAF scores are exposed as Prometheus metrics:

vajracast_vmaf_score{route="main-production", profile="1080p"} 94.2
vajracast_vmaf_score{route="main-production", profile="720p"} 89.7

Alert on VMAF dropping below your quality threshold to catch transcoding issues before they affect viewer experience.

Note: Real-time VMAF computation requires additional CPU resources. Enable it selectively on routes where quality verification is critical.

Next Steps

Return to the Broadcast Streaming Software Guide for the complete feature overview
Learn about Hardware Transcoding to optimize the streams you are monitoring
Explore the REST API for programmatic access to metrics and status

← Back to main guide