---
categories: [aws]
date: 2026-02-06 00:00:00 +0000 UTC
lastmod: 2026-02-06 00:00:00 +0000 UTC
publishdate: 2026-02-06 00:00:00 +0000 UTC
series: [AWS LLM]
slug: inf2-trn3-vllm-neuron-experiment
tags: [aws ec2 llm inferentia trainium neuron vllm self-hosting]
title: Experiment: Self-hosting an LLM on AWS Inf2/Trn3
---

I wanted a simple, repeatable way to answer a practical question:

> “If we had to run a private model on AWS, what’s the operational reality on Inf2 / Trn3?”

This post documents one small **hands-on** experiment (Inf2) plus the exact follow-up checks I’d run on **Trn3**.

## Experiment setup (Inf2)

This was a quick “can we get it working end-to-end?” test, not a full benchmark suite.

- **Date of the run:** 2026-02-01
- **Instance tested:** `inf2.xlarge`
- **Region:** `us-east-2`
- **Model:** `meta-llama/Llama-4-Maverick-17B-128E-Instruct`
- **Serving layer:** vLLM on Neuron
- **API surface:** OpenAI-compatible endpoints (`/v1/chat/completions`, `/v1/completions`)
- **Port:** `8080`

## What worked (and why it matters)

The core outcome: **vLLM on Neuron works**, and the OpenAI-compatible API makes it easy to plug into tooling without writing custom clients.

That said, there were a few very “real world” operational lessons that matter more than raw tokens/sec.

### 1) First-time Neuron compilation is expensive

The first time you run a specific model configuration, Neuron compilation can take **~15–30 minutes**.

This is not inherently a problem, but it changes the UX of:
- rolling restarts
- autoscaling
- spot interruptions

### 2) Persist caches or restarts will hurt

The practical fix is straightforward: **persist the HuggingFace cache and Neuron compiled artifacts** on host storage, so that restarts become fast (often **< 1 minute** once warmed).

If you treat the instance as disposable without persistent caches, you’ll pay the compile cost repeatedly.

### 3) Disk is the first failure mode

For this kind of experiment, disk fills up before CPU does.

Rule of thumb from this run:
- **200GB minimum**
- **500GB recommended** for “real testing” (model weights + compiled artifacts + Docker layers + logs)

## “Private model on EC2” test flow (repeatable)

Here’s the minimal flow I used and would use again:

1. Launch EC2 (Inf2 or Trn3) with **large gp3 EBS** and sane IAM/SSM access.
2. Connect via **SSM** (preferred) instead of opening SSH to the world.
3. Start the vLLM (Neuron) container with **host-persisted caches**.
4. Smoke test the OpenAI-compatible API (`models`, `chat`, `completion`).
5. Observe logs and hardware metrics while running a small request loop.

## Spot instances: make it affordable (with one rule)

For benchmarks and experiments, **Spot** makes this dramatically cheaper than On‑Demand.

The one rule: design the setup for interruptions.

- Use Spot for the run.
- Assume it will be interrupted.
- Keep caches on **EBS** (and snapshot/backup if you want faster recovery).
- Prefer capacity-optimized allocation when you can.

## What about Trn3/Trn4?

We tested the power of the Inf2, but Trn3/Trn4 is available only per request at the moment.

Inf2 could give us the public model providers speed and together with the models evolution and grow it will require less and less resources in the future.

But the key point is independence from the Nvidia and AMD chips — they are limited and demand is high.

## Conclusion

For my current (bursty) workloads, I **don’t need** private model hosting right now: public APIs are cheaper and require significantly less operations.

But if you do have hard requirements (private/proprietary models or strict data boundaries), then **Inf2/Trn3 + vLLM (Neuron)** is a pragmatic AWS path — just plan for compilation time, persistent caches, and big disks.