Question 1

What is Skyportal?

Accepted Answer

Skyportal is production observability for AI workloads. When a workload regresses in production, its agent SARA finds what changed across code, config, models, runtime, and infrastructure on one timeline, ships the fix as a pull request you approve, and proves it held by re-running the workload on staging — before you promote it to production.

Question 2

Who is Skyportal for?

Accepted Answer

Teams running serious AI and ML on real compute: LLM serving, model training, and classical ML on self-hosted or GPU infrastructure. If you run production AI workloads and need to know what changed when something breaks — and fix it safely — Skyportal is built for you.

Question 3

What is SARA?

Accepted Answer

SARA is Skyportal's diagnosis agent. She builds one causal timeline across code, configuration, model versions, runtime, and infrastructure, identifies the root cause of a regression, and proposes the fix as a pull request you review and approve. SARA is read-only first and never changes production on her own.

Question 4

How does Skyportal diagnose a regression?

Accepted Answer

SARA pulls the before-and-after of an incident onto a single timeline — deploys, config changes, model versions, run history, and GPU and host telemetry — and ranks the likely causes from most to least probable, with the evidence for each. Instead of stitching the story together across tabs, you get one timeline and a ranked answer.

Question 5

Does Skyportal change my production directly?

Accepted Answer

No. Skyportal is read-only first. Every fix is an approval-gated pull request, and it's verified by re-running your workload on staging before anything ships. Nothing reaches production until you promote it.

Question 6

How does Skyportal ship a fix?

Accepted Answer

SARA opens the fix as a pull request in your GitHub repo, after checking its blast radius. Your team reviews and merges it, and your existing GitOps — push-based (GitHub Actions) or pull-based (Argo) — ships it. It's a real code change in your workflow, not a suggestion in a chat window.

Question 7

How does Skyportal verify a fix?

Accepted Answer

It re-runs your actual workload on staging and checks the regression is gone (for example, p95 latency back under SLO). If the fix doesn't hold, Skyportal reverts it and works down to the next likely cause until the workload passes. Fixes are proven on your workload, not on a synthetic benchmark.

Question 8

How is Skyportal different from APM tools like Datadog or Dynatrace?

Accepted Answer

APM and infra AIOps tools watch your infrastructure and can auto-remediate with runbooks, but they don't touch your code, config, or model lineage — and they can't verify a fix on your workload. Skyportal connects the change that broke production to the workload it broke and proves the fix on staging.

Question 9

How is Skyportal different from LLM observability like Langfuse or Arize?

Accepted Answer

LLM-observability tools trace prompts, evaluations, and quality drift at the application layer, then stop — they don't fix anything. Skyportal works across code, config, models, runtime, and infra, and ships and verifies the fix.

Question 10

What does Skyportal integrate with?

Accepted Answer

Skyportal hooks into Kubernetes, Slurm, MLflow, Weights & Biases, GitHub (reads code and deploy history, opens PRs), and your existing GitOps (Argo or GitHub Actions). GPU and host telemetry comes from NVIDIA DCGM, Prometheus, and OpenTelemetry.

Question 11

Does it work with my serving framework? Do I need an SDK?

Accepted Answer

Skyportal is framework-agnostic — vLLM, TensorRT-LLM, SGLang, PyTorch, XGBoost, and others — with no per-framework integration and no SDK in your serving path. It reads from the systems your stack already emits to and operates at the run, config, and infra layer.

Question 12

What is a "workload" and how does pricing work?

Accepted Answer

A workload is one monitored service or pipeline — an inference endpoint, a serving cluster, or a recurring training job. Skyportal is priced per workload, not per seat: Free ($0), Pro ($99/month), Teams ($599/month), and Enterprise (from $24,000/year). Pro includes 1 seat (add up to 3 at $100/seat/month); Teams includes 5.

Question 13

Is my data used to train models?

Accepted Answer

On Teams and up, inference runs on Azure-hosted OpenAI and Claude, isolated in an enterprise boundary and never used to train a model. On Free and Pro, it runs through the OpenAI and Anthropic APIs under their standard commercial terms. Enterprise can run a dedicated backend or a fully self-hosted model in your own environment.

Question 14

Can I self-host Skyportal?

Accepted Answer

Yes — on Enterprise. You can run a dedicated backend or a fully self-hosted model entirely in your own environment, with SSO, SCIM, custom roles, and an SLA.

Frequently Asked Questions