Frequently Asked Questions

SkyPortal is an ML operations command center with an AI agent, SARA, that helps teams build, ship, and operate models across cloud and on-prem GPU fleets.

  • One workspace for: fleet, environments, runs, and monitoring
  • Agent-driven triage: ask "why" when something regresses and get evidence-backed next steps
  • Workflow-first: define repeatable workflows instead of stitching together tools and scripts
  • Cross-cloud: keep workflows consistent even when compute is spread across providers

SkyPortal is for ML engineers, platform teams, and research groups that need one control plane to build, ship, and run models across mixed environments.

  • ML engineers: faster setup, reproducible runs, quicker triage
  • Platform teams: standardize environments and workflows across teams
  • Research groups: track experiments, compare runs, keep lineage, and reproducibility

SARA is SkyPortal's AI agent that helps you diagnose issues, speed up setup, and keep ML systems reliable.

  • Answers operational questions using your workspace context
  • Proposes actions with approval gates before making changes
  • Helps with common tasks like environment setup, job orchestration, and troubleshooting

It means SARA can reason across the full workflow rather than one silo.

SARA can use context from:

  • Fleet and infrastructure telemetry (GPU, CPU, memory, I/O)
  • Environments and runtime details (drivers, CUDA, libraries, versions)
  • Code and configs (repo-linked workflows and parameters)
  • Run history (what changed between runs, comparisons, regressions)
  • Monitoring signals (performance shifts, anomalies, drift indicators)

SkyPortal reduces time lost to setup, context switching, and multi-tool debugging across the ML lifecycle.

Build:

  • Connect hosts and validate GPU runtimes
  • Set up environments that match your workload
  • Reproduce runs reliably across machines

Ship:

  • Promote environments (dev, staging, prod) to keep consistency
  • Turn successful runs into repeatable workflows
  • Reduce "works on my machine" failures

Operate:

  • Monitor infra and run behavior in one place
  • Diagnose regressions faster with evidence across layers
  • Reduce GPU waste and improve reliability

You can start in minutes by connecting a host, linking a repo, and running your first workflow.

  • Create an account and workspace
  • Connect a host or fleet via SSH
  • Link a repo and choose an environment (dev, staging, prod)
  • Run a workflow and inspect run history and metrics
  • Ask SARA "why" when something looks off, then approve actions as needed

SkyPortal connects to your existing hosts and inventories runtime details so you can orchestrate and monitor from one place.

  • Connect cloud or on-prem hosts to a SkyPortal workspace
  • Detects GPU type, drivers, CUDA, runtime libraries, and health
  • Organize hosts with tags, environments, and workspaces
  • Centralizes access and orchestration without forcing a full migration

SkyPortal supports AWS, GCP, Azure, NeoClouds, and on-prem GPU fleets.

  • Works with mixed environments in the same workspace
  • Keeps workflows consistent even when compute spans multiple providers
  • Supports teams that use both managed cloud GPUs and private clusters

Environments help teams keep dev, staging, and production consistent.

  • Separate settings for dev, staging, prod
  • Promote and reuse configurations across environments
  • Reduce drift between training and serving runtimes
  • Make runs reproducible by capturing environment details

Experiment tracking and observability are included in SkyPortal so you can see run behavior and system health together.

SkyPortal captures:

  • Run metadata: parameters, configs, environment details
  • Model metrics: common training and evaluation curves
  • System metrics: GPU and CPU utilization, memory, I/O, job status
  • Run history: compare runs, identify what changed, track regressions

Yes. SkyPortal is built to preserve history and reduce switching costs.

  • Weights & Biases: import and view existing run history, then consolidate tracking in SkyPortal if you want
  • MLflow: connect MLflow logs so runs remain queryable while SARA can answer questions using that history (coming soon)
  • Neptune: guided migration path to SkyPortal (coming soon)

No. SkyPortal is designed to fit into your current stack.

  • Keep your repo and version control workflows
  • Keep your cloud accounts and host provisioning approach
  • Import or connect existing experiment history instead of starting from zero
  • Consolidate what makes sense over time rather than forcing a rip-and-replace

Yes. SARA starts in read-only mode and requires explicit approval before write operations.

Read-only capabilities:

  • Explain anomalies using fleet telemetry and run history
  • Identify likely causes of regressions and underutilization
  • Summarize what changed between runs or environments

Actions that require approval:

  • Environment and runtime fixes (drivers, CUDA, dependencies)
  • Starting, stopping, and rerunning jobs
  • Applying configuration changes scoped to a workspace and environment
  • Read access for sensitive files

Plans scale by fleet size, tracked usage, and team governance needs.

Free:

  • Up to 3 hosts are accessible
  • 10 GB agent storage
  • 50 tracked hours per month
  • 5 GB observability storage
  • Best for: individuals and small experiments

Pro ($40/month):

  • Up to 20 hosts are accessible
  • 50 GB agent storage
  • 500 tracked hours per month (additional $1/hour)
  • 50 GB observability storage (additional $1/GB)
  • Best for: small teams and early-stage startups

Teams ($120/user/month):

  • Up to 100 hosts are accessible
  • 1 TB agent storage
  • 5000 tracked hours per month (additional $1/hour)
  • 1 TB observability storage (additional $1/GB)
  • Team collaboration features and stronger controls
  • Best for: teams running production workloads

Enterprise (custom):

  • Custom host limits, storage, and tracked usage
  • Private deployment options and custom security requirements
  • Advanced controls and custom integrations
  • Best for: mission-critical deployments

No. SkyPortal is built for ML engineers, but it also supports the teams around them.

  • Data scientists: experiment comparison and reproducibility
  • Platform and ops teams: fleet visibility, standard environments, governance
  • Engineering leadership: cost visibility, operational reliability, auditability