It turns out AI trading bots behave just like humans –they lose (mostly)
Here’s an interesting exercise worthy of note. Nof1 instructed several AI engines with exactly the same prompts. The results are shown below.
It seems AI engines trade just like humans – they lose(mostly).
Unlike traditional AI evaluations using static data or simulations, Alpha Arena (Nof1) deploys AI models into live crypto currency markets with real capital, measuring their ability to execute trades, manage risk, and generate returns autonomously. It's structured as a competitive"arena" where models "battle" under identical conditions, providing insights into AI behaviour, biases, and limitations in financial decision-making.
The trading period ranged from 10 October to 3 November2025. They all started with $10,000 and were allowed to trade a selection of crypto assets. As you can see, some of the early results were impressive, particularly from the Chinese AI models Qwen and DeepSeek. Though this was an experiment, the results with better prompting may prove quite exciting.
Nof1 AI trading results.

The winning AI engine was QWEN3 Max which grew its initial starting capital of $10,000 (in stablecoins) to $12,287 – a growth of nearly23%.
This was followed by Deekseek Chat V3.1 which ended up at$10,467 (a 4.6% growth).
The rest did terribly, with Chat GPT losing two-thirds of its starting capital. Slightly better was Gemini and Grok, which lost about half the starting capital. Claude Sonnet lost about a third.
The dotted line above is what would have happened with aBitcoin buy-and-hold strategy. That would have left you pretty much on the same level as when you started.
The rules
There’s a few things to note about this AI trading competition.
Models make fully autonomous decisions, such as when to enter and exit a trade, asset selection, position sizing, leverage (up to platform limits, often 5-20x), and direction (long/short). Trades are executed via Hyperliquid’s API on the decentralised exchange, with no human overrides.
Risk Management Constraints
- Built-in guardrails in prompts: Limit max position size (e.g., 10-20% of capital per trade), enforce stop-losses/take-profits, and avoid over-leveraging to prevent total liquidatio
- Models must log reasoning traces, explaining why trade decisions are made or altered (such as “Entering long BTC due to bullish MACD crossover") for post-analysis.
- No short-selling bans or circuit breakers beyond Hyperliquid's rules.
Models scan markets periodically and can hold positions indefinitely or flip rapidly based on signals.
Not that we should be discouraged by the results.
No f1.ai's recap noted "exciting insights" into biases, with 80% of trades having a long bias. The next competition will add multi-asset support and longer tests.
Large Language Models (LLMs) have embedded biases. For example, Chinese models Qwen and DeepSeek excelled at high frequency trading, while GPT-5 and Gemini tended to more conversative biases, trying to avoid risk and therefore missing out on rallies.
What we learned form this is that AIs rely on historical patterns, but these can be influenced by crypto noise, such as flash crashes and whale dumps.
High frequency bots racked up fees, while some models battled with dynamic sizing.
In summary, Alpha Arena proves AIs can trade, by giving them the same instructions yields different (often mediocre) results due to baked-in biases and market chaos. It's a wake-up call for AI finance: Focus on robustness over raw intelligence.
Chinese LLMs still crushed it relatively, but no one beat a simple momentum ETF.
“Soon, we'll launch Season 1.5 of Alpha Arena, which will introduce key improvements and additions to the benchmark. This includes multiple simultaneous competitions, a broader asset universe, long-term memory, and the addition of qualitative data,” says Alpha Arena. Expect these models to improve with time. They will improve because human biases (such as selling winning trades too early, exiting losing trades too late) are overcome by machine learning.

.jpeg)