Experiment: Self-hosting an LLM on AWS Inf2/Trn3

I wanted a simple, repeatable way to answer a practical question:

“If we had to run a private model on AWS, what’s the operational reality on Inf2 / Trn3?”

This post documents one small hands-on experiment (Inf2) plus the exact follow-up checks I’d run on Trn3.

Experiment setup (Inf2) Link to heading

This was a quick “can we get it working end-to-end?” test, not a full benchmark suite.

Date of the run: 2026-02-01
Instance tested: inf2.xlarge
Region: us-east-2
Model: meta-llama/Llama-4-Maverick-17B-128E-Instruct
Serving layer: vLLM on Neuron
API surface: OpenAI-compatible endpoints (/v1/chat/completions, /v1/completions)
Port: 8080

What worked (and why it matters) Link to heading

The core outcome: vLLM on Neuron works, and the OpenAI-compatible API makes it easy to plug into tooling without writing custom clients.

That said, there were a few very “real world” operational lessons that matter more than raw tokens/sec.

1) First-time Neuron compilation is expensive Link to heading

The first time you run a specific model configuration, Neuron compilation can take ~15–30 minutes.

This is not inherently a problem, but it changes the UX of:

rolling restarts
autoscaling
spot interruptions

2) Persist caches or restarts will hurt Link to heading

The practical fix is straightforward: persist the HuggingFace cache and Neuron compiled artifacts on host storage, so that restarts become fast (often < 1 minute once warmed).

If you treat the instance as disposable without persistent caches, you’ll pay the compile cost repeatedly.

3) Disk is the first failure mode Link to heading

For this kind of experiment, disk fills up before CPU does.

Rule of thumb from this run:

200GB minimum
500GB recommended for “real testing” (model weights + compiled artifacts + Docker layers + logs)

“Private model on EC2” test flow (repeatable) Link to heading

Here’s the minimal flow I used and would use again:

Launch EC2 (Inf2 or Trn3) with large gp3 EBS and sane IAM/SSM access.
Connect via SSM (preferred) instead of opening SSH to the world.
Start the vLLM (Neuron) container with host-persisted caches.
Smoke test the OpenAI-compatible API (models, chat, completion).
Observe logs and hardware metrics while running a small request loop.

Spot instances: make it affordable (with one rule) Link to heading

For benchmarks and experiments, Spot makes this dramatically cheaper than On‑Demand.

The one rule: design the setup for interruptions.

Use Spot for the run.
Assume it will be interrupted.
Keep caches on EBS (and snapshot/backup if you want faster recovery).
Prefer capacity-optimized allocation when you can.

What about Trn3/Trn4? Link to heading

We tested the power of the Inf2, but Trn3/Trn4 is available only per request at the moment.

Inf2 could give us the public model providers speed and together with the models evolution and grow it will require less and less resources in the future.

But the key point is independence from the Nvidia and AMD chips — they are limited and demand is high.

Conclusion Link to heading

For my current (bursty) workloads, I don’t need private model hosting right now: public APIs are cheaper and require significantly less operations.

But if you do have hard requirements (private/proprietary models or strict data boundaries), then Inf2/Trn3 + vLLM (Neuron) is a pragmatic AWS path — just plan for compilation time, persistent caches, and big disks.