Architecture: Observability¶

Operational control is part of the product.

What we observe¶

Security signals (invalid payloads, rate limit violations, impossible actions)
Match health (server perf, player ping distribution, disconnect reasons)
Economy integrity (duplicate grants, receipt mismatches)
UX health (client FPS tiers, input device mix)

Correlation model¶

Every log/event should attach:

serverId
placeId
jobId
matchId (if applicable)
playerUserId (when relevant)
requestId (for remote calls)

Event taxonomy¶

Implemented in @rbx/observability and @rbx/analytics:

security.* (violations, detectors)
match.* (start/end, team composition)
economy.* (grants, receipts)
ops.* (publishes, promotions, config changes)
client.* (device tier, fps bucket)

Actionability rules¶

If an event can’t trigger a decision (alert, rollback, ban review), it’s noise.
Sampling is allowed for high-volume client events.

Event output (phased)¶

Phase 1: Structured console logs¶

Events are logged to console as structured JSON using @rbx/observability:

import { emit } from "@rbx/observability";

// Usage
emit({ category: "security", action: "invalid_payload", remote: "Intent_DoAction", playerId: 123 });

Review process: Manual review via Studio output or Roblox Developer Console.

Phase 2: Aggregation service¶

Events are batched and sent to an external endpoint (via HttpService)
Dashboard ingests and indexes events
Basic dashboards for security signals and match health

Phase 3+: Full observability stack¶

Structured log sink (e.g., Axiom, Datadog, self-hosted)
Real-time dashboards
Alerting integration (PagerDuty, Slack, Discord)

Alerting strategy¶

P0 alerts (immediate response required)¶

Alert	Trigger	Response
Economy anomaly	Duplicate grant detected	Disable grants, investigate
Security spike	>100 invalid payloads/min from single player	Auto-kick, flag for review
Server crash rate	>5% of servers crash in 10 min	Rollback release
Config failure	Config fetch fails for >50% of servers	Revert to defaults, investigate

P1 alerts (investigate within 1 hour)¶

Alert	Trigger	Response
Elevated error rate	>1% of remotes returning errors	Investigate, consider rollback
Matchmaking delays	Queue times >5min for 10+ players	Check matchmaker health
Rate limit spikes	>10% of requests rate-limited	Check for abuse or misconfigured client

P2 alerts (review daily)¶

Alert	Trigger	Response
Unusual device mix	Device distribution shifts >20%	Verify client builds
Performance degradation	Server tick rate drops below 50Hz	Profile and optimize

Phase 1 alerting (manual)¶

No automated alerting yet
Daily review of security event logs
Check for patterns: repeated offenders, unusual error spikes

Phase 2+ alerting¶

Integrate with Discord webhook for P0 alerts
Dashboard shows alert history and acknowledgment

Metrics to track (when available)¶

Server-side¶

Requests per second (by remote)
Error rate (by remote, by error code)
P50/P95/P99 latency (by remote)
Rate limit hit rate
Server memory and tick rate
Moderation sync propagation health (e.g. received count, decode errors, message age)

Client-side (sampled)¶

FPS distribution
Ping distribution
Device class distribution
Client error rate

Business metrics¶

Concurrent players
Match completions
Economy transactions

Dashboard integration¶

Security events are aggregated into player “cases”.
Admin actions produce immutable audit records.
Rollouts and kill-switch toggles are logged.