Frequently Asked Questions

Answers to all your questions, quickly and clearly

What is Skyportal?

Skyportal is production observability for AI workloads. When a workload regresses in production, its agent SARA finds what changed across code, config, models, runtime, and infrastructure on one timeline, ships the fix as a pull request you approve, and proves it held by re-running the workload on staging — before you promote it to production.

Who is Skyportal for?

Teams running serious AI and ML on real compute: LLM serving, model training, and classical ML on self-hosted or GPU infrastructure. If you run production AI workloads and need to know what changed when something breaks — and fix it safely — Skyportal is built for you.

What is SARA?

SARA is Skyportal's diagnosis agent. She builds one causal timeline across code, configuration, model versions, runtime, and infrastructure, identifies the root cause of a regression, and proposes the fix as a pull request you review and approve. SARA is read-only first and never changes production on her own.

How does Skyportal diagnose a regression?

SARA pulls the before-and-after of an incident onto a single timeline — deploys, config changes, model versions, run history, and GPU and host telemetry — and ranks the likely causes from most to least probable, with the evidence for each. Instead of stitching the story together across tabs, you get one timeline and a ranked answer.

Does Skyportal change my production directly?

No. Skyportal is read-only first. Every fix is an approval-gated pull request, and it's verified by re-running your workload on staging before anything ships. Nothing reaches production until you promote it.

How does Skyportal ship a fix?

SARA opens the fix as a pull request in your GitHub repo, after checking its blast radius. Your team reviews and merges it, and your existing GitOps — push-based (GitHub Actions) or pull-based (Argo) — ships it. It's a real code change in your workflow, not a suggestion in a chat window.

How does Skyportal verify a fix?

It re-runs your actual workload on staging and checks the regression is gone (for example, p95 latency back under SLO). If the fix doesn't hold, Skyportal reverts it and works down to the next likely cause until the workload passes. Fixes are proven on your workload, not on a synthetic benchmark.

How is Skyportal different from APM tools like Datadog or Dynatrace?

APM and infra AIOps tools watch your infrastructure and can auto-remediate with runbooks, but they don't touch your code, config, or model lineage — and they can't verify a fix on your workload. Skyportal connects the change that broke production to the workload it broke and proves the fix on staging.

How is Skyportal different from LLM observability like Langfuse or Arize?

LLM-observability tools trace prompts, evaluations, and quality drift at the application layer, then stop — they don't fix anything. Skyportal works across code, config, models, runtime, and infra, and ships and verifies the fix.

What does Skyportal integrate with?

Skyportal hooks into Kubernetes, Slurm, MLflow, Weights & Biases, GitHub (reads code and deploy history, opens PRs), and your existing GitOps (Argo or GitHub Actions). GPU and host telemetry comes from NVIDIA DCGM, Prometheus, and OpenTelemetry.

Does it work with my serving framework? Do I need an SDK?

Skyportal is framework-agnostic — vLLM, TensorRT-LLM, SGLang, PyTorch, XGBoost, and others — with no per-framework integration and no SDK in your serving path. It reads from the systems your stack already emits to and operates at the run, config, and infra layer.

What is a "workload" and how does pricing work?

A workload is one monitored service or pipeline — an inference endpoint, a serving cluster, or a recurring training job. Skyportal is priced per workload, not per seat: Free ($0), Pro ($99/month), Teams ($599/month), and Enterprise (from $24,000/year). Pro includes 1 seat (add up to 3 at $100/seat/month); Teams includes 5.

Is my data used to train models?

On Teams and up, inference runs on Azure-hosted OpenAI and Claude, isolated in an enterprise boundary and never used to train a model. On Free and Pro, it runs through the OpenAI and Anthropic APIs under their standard commercial terms. Enterprise can run a dedicated backend or a fully self-hosted model in your own environment.

Can I self-host Skyportal?

Yes — on Enterprise. You can run a dedicated backend or a fully self-hosted model entirely in your own environment, with SSO, SCIM, custom roles, and an SLA.

Still have a question in mind?

Contact us if you have any other questions.

Contact us