Skip to content
🚧 These docs are a work in progress and may contain inaccuracies. Content is being actively reviewed and validated.

Monitoring

Dubby exposes Prometheus metrics, structured logs, and audit trails out of the box. This page covers how to collect and use them. For the built-in admin dashboard UI, see Dashboard.

The server exposes a /metrics endpoint in Prometheus text format. Metrics are enabled by default and can be disabled with the DUBBY_METRICS_ENABLED=false environment variable.

Point your Prometheus instance at the server pod on port 3000:

prometheus.yml
scrape_configs:
- job_name: dubby
scrape_interval: 15s
static_configs:
- targets: ['dubby-server:3000']

On Kubernetes with the Prometheus Operator, add pod annotations instead:

server:
podAnnotations:
prometheus.io/scrape: 'true'
prometheus.io/port: '3000'
prometheus.io/path: '/metrics'

Or create a ServiceMonitor:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: dubby
spec:
selector:
matchLabels:
app.kubernetes.io/name: dubby
app.kubernetes.io/component: server
endpoints:
- port: http
path: /metrics
interval: 15s
MetricTypeDescription
dubby_streaming_active_sessionsGaugeCurrent active playback sessions
dubby_streaming_concurrent_usersGaugeDistinct users with active sessions
dubby_streaming_sessions_created_totalCounterTotal sessions by type (direct_play / transcode)
dubby_streaming_session_duration_secondsHistogramSession duration
dubby_streaming_transcode_startup_secondsHistogramTime to first segment
dubby_streaming_seek_duration_secondsHistogramSeek operation latency
dubby_streaming_ffmpeg_processesGaugeActive FFmpeg processes
dubby_streaming_playback_tier_totalCounterSessions by playback mode tier
dubby_streaming_errors_totalCounterErrors by type (ffmpeg_crash, segment_timeout, etc.)
dubby_streaming_bytes_served_totalCounterTotal bytes delivered to clients
MetricTypeDescription
dubby_workflow_activeGaugeCurrently running workflows by type
dubby_workflow_queue_depthGaugeJob queue depth by queue name
dubby_workflow_completed_totalCounterCompleted workflows by type and status
dubby_workflow_duration_secondsHistogramWorkflow total duration
dubby_workflow_step_duration_secondsHistogramIndividual step duration
MetricTypeDescription
dubby_api_requests_totalCounterTotal requests by router/procedure/status
dubby_api_request_duration_secondsHistogramRequest duration
dubby_api_errors_totalCounterErrors by router and error code
MetricTypeDescription
dubby_system_memory_bytesGaugeProcess memory by type (rss, heap_used, heap_total)
dubby_system_uptime_secondsGaugeProcess uptime
dubby_system_cpu_seconds_totalGaugeCPU time by type (user, system)
dubby_system_transcode_cache_bytesGaugeTranscode cache disk usage
MetricTypeDescription
dubby_library_items_totalGaugeLibrary items by content type
dubby_library_storage_bytesGaugeTotal library storage
dubby_metadata_requests_totalCounterExternal metadata API requests by provider and status

Import the metrics above into Grafana to build dashboards. Some useful panels:

  • Active streams: dubby_streaming_active_sessions
  • Transcode startup P95: histogram_quantile(0.95, rate(dubby_streaming_transcode_startup_seconds_bucket[5m]))
  • API error rate: rate(dubby_api_errors_total[5m])
  • Workflow queue depth: dubby_workflow_queue_depth — a growing queue means the worker is falling behind
  • Cache disk usage: dubby_system_transcode_cache_bytes — alert if approaching your disk limit
# Prometheus alerting rules
groups:
- name: dubby
rules:
- alert: DubbyDown
expr: up{job="dubby"} == 0
for: 1m
annotations:
summary: Dubby server is down
- alert: DubbyHighTranscodeStartup
expr: histogram_quantile(0.95, rate(dubby_streaming_transcode_startup_seconds_bucket[5m])) > 10
for: 5m
annotations:
summary: Transcode startup P95 exceeds 10 seconds
- alert: DubbyWorkflowQueueBacklog
expr: dubby_workflow_queue_depth > 50
for: 10m
annotations:
summary: Workflow queue depth has exceeded 50 for 10 minutes
- alert: DubbyCacheDiskHigh
expr: dubby_system_transcode_cache_bytes > 50e9
for: 5m
annotations:
summary: Transcode cache exceeds 50 GB

Dubby uses structured logging (pino) with configurable level.

Environment variableDefaultOptions
LOG_LEVELinfodebug, info, warn, error

Log format is determined by NODE_ENV: JSON in production, pretty-printed with color in development. There is no separate format toggle.

In production (NODE_ENV=production), each log line is a single JSON object suitable for ingestion by Loki, Elasticsearch, Datadog, or any structured log collector:

{
"level": "info",
"time": 1710000000000,
"service": "streaming",
"sessionId": "abc123",
"msg": "session created"
}

In development, logs are human-readable with color (via pino-pretty).

Dubby logs to stdout, so any Kubernetes log collector works:

  • Loki + Promtail/Alloy — scrape pod logs by label
  • Fluentd / Fluent Bit — parse JSON logs natively
  • Datadog Agent — auto-discovers pod logs

No sidecar is needed. Example Promtail scrape config:

scrape_configs:
- job_name: dubby
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_namespace]
regex: dubby
action: keep
pipeline_stages:
- json:
expressions:
level: level
service: service
LevelWhat it includes
errorUnrecoverable failures (crashed FFmpeg, database errors)
warnRecoverable issues (retry, timeout, missing metadata)
infoKey lifecycle events (session created, scan started, migration applied)
debugVerbose internals (FFmpeg args, SQL queries, segment timing)

Dubby records security-relevant events to an audit log stored in the database. Audit logs are accessible to admins via the API and the admin UI.

ActionWhen it’s recorded
login_attemptEvery login (success or failure), with email and IP
user_createNew user account created
user_deleteUser account deleted
config_changeServer configuration modified
privacy_changePrivacy settings updated
data_accessUser data accessed
data_exportUser data exported (GDPR)
data_deleteUser data deleted (GDPR)
external_requestOutbound request to external service

REST API:

Terminal window
# List recent audit logs (admin only)
curl -H "Authorization: Bearer <token>" \
https://dubby.example.com/api/v1/audit-logs?limit=50
# Filter by action
curl -H "Authorization: Bearer <token>" \
https://dubby.example.com/api/v1/audit-logs?action=login_attempt
# Aggregate stats
curl -H "Authorization: Bearer <token>" \
https://dubby.example.com/api/v1/audit-logs/stats

Audit log retention is configurable in the privacy settings:

config:
privacy:
auditRetentionDays: 90 # delete entries older than 90 days

Set to null (the default) to retain indefinitely.

The server exposes health check endpoints used by Kubernetes probes and external monitoring:

EndpointPurposeChecks
GET /health/Basic healthReturns { status: "ok" }
GET /health/liveLiveness probeAlways returns { status: "alive" } unless the process is hanging
GET /health/readyReadiness probeVerifies database connectivity with SELECT 1

The readiness endpoint returns HTTP 503 during startup (before migrations complete) and during graceful shutdown, preventing traffic from reaching a pod that isn’t ready to serve requests.

These endpoints do not require authentication.