The problem
Tensora's research team was burning $130K a month on GPU compute with effective utilisation under 55%, because nobody dared shut clusters down between runs. Multi-day waits for H100 capacity stalled experiment timelines, distributed PyTorch and DeepSpeed runs were failing 28% of the time, and there was no shared experiment-tracking story — reproducing a previous configuration meant spelunking through env vars and shell scripts.
What we shipped
A managed training platform that schedules jobs across spot P5 and P4d capacity in four regions, falling back through tiers based on real-time pricing and interruption data, and tearing clusters down within minutes of job completion. FSx for Lustre serves data at 100+ GB/s with transparent S3 backing, EFA networking tuned for NCCL on P5 topology, and an MLflow registry plus a single CLI/API standardise every job from submission through SageMaker deployment.
The outcome
Monthly GPU spend dropped 45% from $130K to $72K and per-run cost on the flagship LLM fell from $4,200 to $2,300. Pipeline reliability climbed from 72% to 99.2% over a three-month measurement window — 487 of 491 runs completed with no human in the loop. Researchers now go from idea to running cluster in 15 minutes and iterate roughly 3× faster.