Your Questions. Answered.

Answers to all your questions, quickly and clearly

What is SkyPortal?

SkyPortal is an AI-native platform that simplifies how teams build, train, and manage machine learning models. Instead of juggling multiple platforms like Weights & Biases, GitHub, cloud consoles, and observability tools, SkyPortal unifies everything in one environment. At its core, SkyPortal enables:

  • Multi-cloud management: connect to every host you work with from any cloud provider and track server health stats, job runtime health, and experiment/use case progress in each with convenient tagging and a single terminal interface
  • Job orchestration: easily launch and manage training jobs on cloud or on-prem GPUs
  • Observability: real-time metrics on accuracy, loss, epochs, GPU usage, and budget spend
  • Collaboration: centralized logging, experiment tracking, and reproducibility
  • AI agents: smart copilots that debug, optimize, and automate workflows for ML engineers and their teams

How do I use SkyPortal?

SkyPortal is designed to be simple to get started with, even if you've never set up MLOps tooling before.

  • Sign up and choose a plan — Free, Pro, Premium, or Enterprise.
  • Connect your environment — AWS, GCP, Azure, or Tier 3 GPU providers like RunPod and Vast.ai.
  • Upload your model/data or link from GitHub/MinIO/S3.
  • Launch training jobs directly from the SkyPortal dashboard or CLI.
  • Monitor results through real-time dashboards with accuracy, loss curves, and budget tracking.
  • Iterate quickly with AI agents that provide recommendations on hyperparameters, code fixes, and resource usage.

For enterprise teams, the platform supports private cloud deployment, role-based access control, and integration with internal data registries.

What are the common tasks this software helps with?

SkyPortal eliminates the friction in managing the ML lifecycle. Common tasks include:

  • Experiment Tracking: Automatically logs hyperparameters, code versions, datasets, and metrics.
  • Usage Monitoring: Tracks GPU hours, storage consumption, and job status.
  • Early Stopping & Budget Control: Stops jobs when they exceed cost or performance thresholds.
  • Collaboration: Enables teams to share experiment histories, compare runs, and reproduce results.
  • Multi-Cloud Training: Supports training jobs across AWS, GCP, Azure, and independent GPU providers.
  • Observability at Scale: Provides dashboards with metrics like MAE, MSE, accuracy, throughput, and infrastructure usage.
  • Model & Data Versioning: Ensures every training job is reproducible and auditable.

In short, it helps MLEs, data scientists, and product teams focus on building better models instead of wrestling with infrastructure.

Is SkyPortal only for ML engineers?

No. While SkyPortal is built with machine learning engineers in mind, it's also useful for:

  • Non ML Engineers: engineers with multiple cloud hosts that they wish to manage in one place
  • Data Scientists: who want seamless training and easy experiment comparison
  • Product Managers: who need visibility into training progress, costs, and performance metrics
  • Engineering Leaders: who want cost transparency, reproducibility, and compliance
  • Enterprise IT/Ops Teams: who need a secure, scalable way to support AI initiatives

SkyPortal is not limited to hardcore MLEs — it empowers anyone involved in the AI product lifecycle to get value from observability, collaboration, and efficiency.

What's included in the subscription plans?

SkyPortal has four pricing tiers, based on usage and features:

Free Tier

  • Up to 3 hosts accessible
  • 10 GB of agent storage
  • 50 hours of observability tracking per month
  • 5 GB observability storage
  • Perfect for individuals and small experiments

Pro Tier – $40/month

  • Up to 20 hosts accessible
  • 50 GB of agent storage
  • 500 tracked hours/month (additional $1/hour)
  • 50 GB observability storage (additional $1/GB)
  • Great for small teams or early-stage startups

Premium Tier – $120/month

  • Up to 100 hosts accessible
  • 1 TB agent storage
  • 5000 tracked hours/month (additional $1/hour)
  • 1 TB observability storage (additional $1/GB)
  • Designed for growing companies with intensive training needs

Enterprise Tier – Custom

  • Custom host limits, storage, and observability hours
  • Flexible billing and support for private cloud/on-prem deployments
  • Role-based access control, enhanced security, and custom integrations
  • Ideal for enterprises running mission-critical AI systems

How does SkyPortal help with observability?

Observability is at the heart of SkyPortal. Every training job generates real-time metrics:

  • Performance metrics: loss, accuracy, MAE, MSE
  • System metrics: GPU utilization, CPU load, memory, I/O
  • Financial metrics: budget used, cost per epoch, overage warnings

SkyPortal automatically alerts users when thresholds are crossed, making it easy to stop wasteful jobs or troubleshoot failures. This prevents runaway costs and accelerates experimentation cycles.

How does SkyPortal save money?

SkyPortal is designed for cost-aware training:

  • Automatic Early Stopping: stops jobs when models flatten out in performance
  • Overage Controls: alerts users before they exceed hours or storage limits
  • Multi-GPU Optimization: recommends efficient kernel launches and memory padding
  • Resource Allocation: ensures the right size GPU is used for the job, minimizing waste

Customers consistently save by preventing overuse and making smarter choices about training resources.

Does SkyPortal work with my existing tools?

Yes. SkyPortal integrates with:

  • ML frameworks: PyTorch, TensorFlow, Hugging Face
  • Data storage: S3, MinIO
  • Version control: GitHub
  • Experiment tracking tools: Weights & Biases (W&B)
  • Infrastructure: Kubernetes, AWS, GCP, Azure, RunPod, Vast.ai

This ensures you don't have to abandon existing workflows — you can extend them.

Is SkyPortal scalable?

Yes. SkyPortal supports both small teams and global enterprises. It handles:

  • Multi-user access with granular permissions
  • Large-scale training jobs with checkpointing and distributed GPU support
  • On-prem and private cloud deployment for security-conscious industries

How do I get started?

Getting started with SkyPortal is simple:

  • Create an account
  • Choose a plan that fits your needs
  • Connect your data and GPU resources
  • Launch your first training job
  • Monitor progress through the dashboard

Within minutes, you'll have an AI-native workflow for building, training, and scaling models.

Still have a question in mind?

Contact us if you have any other questions.

Contact us