Cost Control
LLM pipelines can get expensive fast. Stroma tracks three resource dimensions and lets you set hard limits on any of them.
The three dimensions
| Dimension | What it tracks | Unit |
|---|---|---|
| Tokens | Total LLM tokens consumed | integer |
| Cost | Total monetary spend | USD (float) |
| Latency | Total wall-clock time | milliseconds |
Setting a budget
from stroma import ExecutionBudget, StromaRunner
runner = StromaRunner.quick(
budget=ExecutionBudget(
max_tokens_total=10_000, # (1)!
max_cost_usd=0.50, # (2)!
max_latency_ms=30_000, # (3)!
)
)
- Stop if total tokens across all nodes exceed 10,000.
- Stop if total cost exceeds $0.50.
- Stop if total wall-clock time exceeds 30 seconds.
Set any combination — None means no limit on that dimension. ExecutionBudget.unlimited() (the default) sets all three to None.
Reporting token usage
Nodes report token usage by returning a tuple instead of a plain dict. Four return shapes are supported:
# No usage tracking — tokens=0, cost=$0.00
return {"summary": text}
# Report total tokens only
return ({"summary": text}, 500)
# Report input tokens + model name for automatic cost calculation
return ({"summary": text}, 500, "gpt-4o")
# Full usage: input tokens, output tokens, model
return ({"summary": text}, 400, 100, "gpt-4o")
Model-based cost estimation
When you include a model name in the return tuple, Stroma looks it up in KNOWN_MODELS and computes the USD cost automatically:
import asyncio
from pydantic import BaseModel
from stroma import ExecutionBudget, StromaRunner
class Input(BaseModel):
text: str
class Output(BaseModel):
summary: str
runner = StromaRunner.quick(
budget=ExecutionBudget(max_cost_usd=0.10)
)
@runner.node("summarize", input=Input, output=Output)
async def summarize(state: Input) -> tuple:
# In real code, this would call an LLM
summary = state.text[:100]
input_tokens = len(state.text.split())
output_tokens = len(summary.split())
return ({"summary": summary}, input_tokens, output_tokens, "gpt-4o") # (1)!
async def main():
result = await runner.run(
[summarize],
Input(text="A short document to summarize."),
)
print(f"Tokens used: {result.total_tokens}")
print(f"Cost: ${result.total_cost_usd:.6f}")
asyncio.run(main())
- The model string
"gpt-4o"is looked up inKNOWN_MODELSto compute per-token pricing. Unknown models default to $0.00.
Built-in pricing is included for:
| Model | Input (per 1M tokens) | Output (per 1M tokens) |
|---|---|---|
gpt-4o |
$2.50 | $10.00 |
gpt-4o-mini |
$0.15 | $0.60 |
gpt-4-turbo |
$10.00 | $30.00 |
claude-opus-4-6 |
$15.00 | $75.00 |
claude-sonnet-4-6 |
$3.00 | $15.00 |
claude-haiku-4-5 |
$0.80 | $4.00 |
gemini-1.5-pro |
$3.50 | $10.50 |
gemini-1.5-flash |
$0.35 | $1.05 |
What happens when a budget is exceeded
When cumulative usage crosses a budget limit, BudgetExceeded is raised. By default this is classified as RECOVERABLE — the runner will retry (useful if the budget was close and a cheaper retry succeeds).
async def main():
result = await runner.run([expensive_node], Input(text="..." * 10_000))
if result.status != "COMPLETED":
print(f"Budget exceeded! Used {result.total_tokens} tokens")
Tip
If you want budget violations to be terminal (no retries), add a custom classifier:
from stroma import FailureClass
from stroma.cost import BudgetExceeded
def budget_is_terminal(exc, ctx):
if isinstance(exc, BudgetExceeded):
return FailureClass.TERMINAL
return None
runner = StromaRunner.quick(classifiers=[budget_is_terminal])
Model fallback at budget threshold
Instead of failing when costs get high, you can automatically downgrade to a cheaper model. The with_model_fallback builder sets a threshold: when cumulative spend reaches a percentage of the budget, subsequent nodes receive a fallback model signal via ctx["_stroma_model"].
runner = (
StromaRunner.quick()
.with_budget(cost_usd=1.00)
.with_model_fallback("gpt-4o", to="gpt-4o-mini", at_budget_pct=0.80)
)
When the pipeline has spent 80% of its $1.00 budget, any node that would use gpt-4o gets gpt-4o-mini instead. Nodes read the signal from the context dict:
@runner.node("summarize", input=Input, output=Output)
async def summarize(state: Input, ctx: dict) -> tuple:
model = ctx.get("_stroma_model", "gpt-4o") # (1)!
# Pass `model` to your LLM client
return ({"summary": "..."}, input_tokens, output_tokens, model)
- Falls back to
"gpt-4o"when below threshold (key absent from context).
You can chain multiple fallback rules for different models or thresholds:
runner = (
StromaRunner.quick()
.with_budget(cost_usd=5.00)
.with_model_fallback("gpt-4-turbo", to="gpt-4o", at_budget_pct=0.60)
.with_model_fallback("gpt-4o", to="gpt-4o-mini", at_budget_pct=0.80)
)
Inspecting cost after a run
The ExecutionResult includes total token and cost counters:
result = await runner.run(nodes, initial_state)
print(result.total_tokens) # aggregate tokens across all nodes
print(result.total_cost_usd) # aggregate cost in USD across all nodes
Recap
ExecutionBudgetsets limits on tokens, cost, and latency- Nodes report usage by returning tuples:
(dict, tokens),(dict, input_tokens, model), or(dict, input_tokens, output_tokens, model) - Model-based pricing computes
cost_usdautomatically fromKNOWN_MODELS BudgetExceededis raised when a limit is crossed (classified as recoverable by default)result.total_tokensandresult.total_cost_usdgive you post-run totals- Use
ExecutionBudget.unlimited()(the default) to disable all limits
Next: Tracing & Debugging — inspect every execution attempt in detail.