I wanted a simple, repeatable way to answer a practical question:
“If we had to run a private model on AWS, what’s the operational reality on Inf2 / Trn3?”
This post documents one small hands-on experiment (Inf2) plus the exact follow-up checks I’d run on Trn3.
Experiment setup (Inf2) Link to heading
This was a quick “can we get it working end-to-end?” test, not a full benchmark suite.
- Date of the run: 2026-02-01
- Instance tested:
inf2.xlarge - Region:
us-east-2 - Model:
meta-llama/Llama-4-Maverick-17B-128E-Instruct - Serving layer: vLLM on Neuron
- API surface: OpenAI-compatible endpoints (
/v1/chat/completions,/v1/completions) - Port:
8080
What worked (and why it matters) Link to heading
The core outcome: vLLM on Neuron works, and the OpenAI-compatible API makes it easy to plug into tooling without writing custom clients.
That said, there were a few very “real world” operational lessons that matter more than raw tokens/sec.
1) First-time Neuron compilation is expensive Link to heading
The first time you run a specific model configuration, Neuron compilation can take ~15–30 minutes.
This is not inherently a problem, but it changes the UX of:
- rolling restarts
- autoscaling
- spot interruptions
2) Persist caches or restarts will hurt Link to heading
The practical fix is straightforward: persist the HuggingFace cache and Neuron compiled artifacts on host storage, so that restarts become fast (often < 1 minute once warmed).
If you treat the instance as disposable without persistent caches, you’ll pay the compile cost repeatedly.
3) Disk is the first failure mode Link to heading
For this kind of experiment, disk fills up before CPU does.
Rule of thumb from this run:
- 200GB minimum
- 500GB recommended for “real testing” (model weights + compiled artifacts + Docker layers + logs)
“Private model on EC2” test flow (repeatable) Link to heading
Here’s the minimal flow I used and would use again:
- Launch EC2 (Inf2 or Trn3) with large gp3 EBS and sane IAM/SSM access.
- Connect via SSM (preferred) instead of opening SSH to the world.
- Start the vLLM (Neuron) container with host-persisted caches.
- Smoke test the OpenAI-compatible API (
models,chat,completion). - Observe logs and hardware metrics while running a small request loop.
Spot instances: make it affordable (with one rule) Link to heading
For benchmarks and experiments, Spot makes this dramatically cheaper than On‑Demand.
The one rule: design the setup for interruptions.
- Use Spot for the run.
- Assume it will be interrupted.
- Keep caches on EBS (and snapshot/backup if you want faster recovery).
- Prefer capacity-optimized allocation when you can.
What about Trn3/Trn4? Link to heading
We tested the power of the Inf2, but Trn3/Trn4 is available only per request at the moment.
Inf2 could give us the public model providers speed and together with the models evolution and grow it will require less and less resources in the future.
But the key point is independence from the Nvidia and AMD chips — they are limited and demand is high.
Conclusion Link to heading
For my current (bursty) workloads, I don’t need private model hosting right now: public APIs are cheaper and require significantly less operations.
But if you do have hard requirements (private/proprietary models or strict data boundaries), then Inf2/Trn3 + vLLM (Neuron) is a pragmatic AWS path — just plan for compilation time, persistent caches, and big disks.