All work
AI / MLProduction Build

Tensora AI

Managed GPU platform for model training

Multi-region spot GPU scheduling with FSx for Lustre and MLflow — 45% cheaper compute and 99.2% pipeline reliability.

45%
GPU cost reduction
99.2%
Pipeline reliability
15 min
Job provisioning, was days

The problem

Tensora's research team was burning $130K a month on GPU compute with effective utilisation under 55%, because nobody dared shut clusters down between runs. Multi-day waits for H100 capacity stalled experiment timelines, distributed PyTorch and DeepSpeed runs were failing 28% of the time, and there was no shared experiment-tracking story — reproducing a previous configuration meant spelunking through env vars and shell scripts.

What we shipped

A managed training platform that schedules jobs across spot P5 and P4d capacity in four regions, falling back through tiers based on real-time pricing and interruption data, and tearing clusters down within minutes of job completion. FSx for Lustre serves data at 100+ GB/s with transparent S3 backing, EFA networking tuned for NCCL on P5 topology, and an MLflow registry plus a single CLI/API standardise every job from submission through SageMaker deployment.

The outcome

Monthly GPU spend dropped 45% from $130K to $72K and per-run cost on the flagship LLM fell from $4,200 to $2,300. Pipeline reliability climbed from 72% to 99.2% over a three-month measurement window — 487 of 491 runs completed with no human in the loop. Researchers now go from idea to running cluster in 15 minutes and iterate roughly 3× faster.

Under the hood

Amazon EC2 (P5, P4d)Amazon FSx for LustreAmazon S3AWS Elastic Fabric AdapterAmazon SageMakerAWS Auto ScalingAmazon CloudWatch

GPU infrastructure was consuming our best engineers' time and our runway. Remāngu gave us enterprise-grade training infrastructure that actually costs less than the fragile setup we were managing ourselves. Our researchers went from waiting days for compute to running experiments within minutes.

Priya Raghavan, Head of AI Infrastructure, Tensora AI

Next case study

Upheal

24/7 ops for an AI health platform