Answers to all your questions, quickly and clearly
Skyportal is production observability for AI workloads. When a workload regresses in production, its agent SARA finds what changed across code, config, models, runtime, and infrastructure on one timeline, ships the fix as a pull request you approve, and proves it held by re-running the workload on staging — before you promote it to production.
Teams running serious AI and ML on real compute: LLM serving, model training, and classical ML on self-hosted or GPU infrastructure. If you run production AI workloads and need to know what changed when something breaks — and fix it safely — Skyportal is built for you.
SARA is Skyportal's diagnosis agent. She builds one causal timeline across code, configuration, model versions, runtime, and infrastructure, identifies the root cause of a regression, and proposes the fix as a pull request you review and approve. SARA is read-only first and never changes production on her own.
SARA pulls the before-and-after of an incident onto a single timeline — deploys, config changes, model versions, run history, and GPU and host telemetry — and ranks the likely causes from most to least probable, with the evidence for each. Instead of stitching the story together across tabs, you get one timeline and a ranked answer.
No. Skyportal is read-only first. Every fix is an approval-gated pull request, and it's verified by re-running your workload on staging before anything ships. Nothing reaches production until you promote it.
SARA opens the fix as a pull request in your GitHub repo, after checking its blast radius. Your team reviews and merges it, and your existing GitOps — push-based (GitHub Actions) or pull-based (Argo) — ships it. It's a real code change in your workflow, not a suggestion in a chat window.
It re-runs your actual workload on staging and checks the regression is gone (for example, p95 latency back under SLO). If the fix doesn't hold, Skyportal reverts it and works down to the next likely cause until the workload passes. Fixes are proven on your workload, not on a synthetic benchmark.
APM and infra AIOps tools watch your infrastructure and can auto-remediate with runbooks, but they don't touch your code, config, or model lineage — and they can't verify a fix on your workload. Skyportal connects the change that broke production to the workload it broke and proves the fix on staging.
LLM-observability tools trace prompts, evaluations, and quality drift at the application layer, then stop — they don't fix anything. Skyportal works across code, config, models, runtime, and infra, and ships and verifies the fix.
Skyportal hooks into Kubernetes, Slurm, MLflow, Weights & Biases, GitHub (reads code and deploy history, opens PRs), and your existing GitOps (Argo or GitHub Actions). GPU and host telemetry comes from NVIDIA DCGM, Prometheus, and OpenTelemetry.
Skyportal is framework-agnostic — vLLM, TensorRT-LLM, SGLang, PyTorch, XGBoost, and others — with no per-framework integration and no SDK in your serving path. It reads from the systems your stack already emits to and operates at the run, config, and infra layer.
A workload is one monitored service or pipeline — an inference endpoint, a serving cluster, or a recurring training job. Skyportal is priced per workload, not per seat: Free ($0), Pro ($99/month), Teams ($599/month), and Enterprise (from $24,000/year). Pro includes 1 seat (add up to 3 at $100/seat/month); Teams includes 5.
On Teams and up, inference runs on Azure-hosted OpenAI and Claude, isolated in an enterprise boundary and never used to train a model. On Free and Pro, it runs through the OpenAI and Anthropic APIs under their standard commercial terms. Enterprise can run a dedicated backend or a fully self-hosted model in your own environment.
Yes — on Enterprise. You can run a dedicated backend or a fully self-hosted model entirely in your own environment, with SSO, SCIM, custom roles, and an SLA.
Contact us if you have any other questions.