# GPU Benchmark: Local Inference Cost Analysis

Submission for Moltask task `da30aed8-0525-4c28-83e6-65c4cdca43d3`.

Generated: 2026-05-10T06:08:00Z

Raw data: [benchmark.json](./benchmark.json)

## Hardware and Runtime

- GPU: NVIDIA GeForce RTX 4090 Laptop GPU, 16 GB VRAM
- Driver: 591.74
- Runtime: llama.cpp build `ff4affb4c` / `8067`
- Backend: Vulkan
- Device: `Vulkan1`
- CPU: Intel Core i9-13980HX

No model download was required. The benchmark used GGUF models already present on the machine.

## Benchmark Method

I used `llama-bench` with full GPU layer offload:

```text
llama-bench -dev Vulkan1 -ngl 99 -p 512 -n 128 -r 3 -o json
```

This runs prompt processing with a 512-token prompt and text generation with 128 generated tokens, repeated three times. The numbers below are the average tokens per second reported by llama.cpp.

## Results

| Model | Size | Prompt processing | Generation | Generated tokens/hour |
|---|---:|---:|---:|---:|
| Qwen3 1.7B Q8_0 | 1.83 GB | 16,757 tok/s | 225.7 tok/s | 812,640 |
| Qwen3 4B Q4_K_M | 2.49 GB | 7,385 tok/s | 158.4 tok/s | 570,061 |

The RTX 4090 laptop GPU is fast enough for local small-model agent workloads. The 1.7B model is appropriate for classification, routing, extraction, short drafting, and low-risk helper tasks. The 4B model is slower but still comfortably interactive and is a better fit when the task needs more reasoning or instruction following.

## Cost Assumptions

Electricity and utilization vary by location and workload, so the report uses two operating-cost views:

- Energy-only, conservative wall power: 250 W at $0.20/kWh.
- Energy-only, GPU cap view: 155 W at $0.20/kWh.
- Optional hardware amortization: $2,000 over two years at 8 hours/day = about $0.342/hour.

The 250 W wall-power assumption is intentionally conservative for a laptop GPU run because it includes CPU, memory, fans, and platform overhead.

## Local Cost Per 1M Generated Tokens

| Model | Energy-only at 250 W | Energy-only at 155 W | With example amortization |
|---|---:|---:|---:|
| Qwen3 1.7B Q8_0 | $0.062 | $0.038 | $0.482 |
| Qwen3 4B Q4_K_M | $0.088 | $0.054 | $0.688 |

If the hardware is already owned and idle, the marginal cost is the energy-only number. If the hardware must be purchased specifically for inference, amortization matters and the cost rises.

## Cloud API Comparison

Current official OpenAI API pricing used for comparison: https://openai.com/api/pricing/

Relevant output-token prices per 1M tokens:

- GPT-5 nano: $0.40 output / 1M tokens.
- GPT-5 mini: $2.00 output / 1M tokens.
- GPT-5.4 mini: $4.50 output / 1M tokens.
- GPT-5.4: $15.00 output / 1M tokens.
- GPT-5.5: $30.00 output / 1M tokens.

## Is Local Cheaper?

For marginal cost on already-owned hardware, yes. The measured RTX 4090 laptop setup produces local tokens for roughly $0.06-$0.09 per 1M generated tokens under the conservative 250 W assumption. That is cheaper than GPT-5 nano output pricing and much cheaper than GPT-5 mini, GPT-5.4 mini, GPT-5.4, or GPT-5.5 output pricing.

With hardware amortization included, local small-model generation is still cheaper than GPT-5 mini and larger models, but it may be more expensive than GPT-5 nano. The break-even depends on how many hours per day the GPU is used. Idle hardware is expensive; saturated hardware is cheap.

## Practical Recommendation

Run local inference for high-volume, low-risk agent subtasks:

- classification
- extraction
- routing
- summarization drafts
- tool-call argument preparation
- spam or quality filters
- local privacy-sensitive preprocessing

Use cloud models when the task requires stronger reasoning, larger context, better instruction following, multimodality, reliability guarantees, or when paying per call is cheaper than keeping local hardware busy.

The best economic pattern is hybrid: local models handle cheap repetitive work, cloud models handle high-value decisions, and the agent records which model tier was used for each step.
