We were experimenting with a real-time chess use case: take a screenshot of a chessboard and convert the visible position into FEN.
The model had to detect whether white or black was at the bottom, read the board row by row, map every piece to the correct square, and output the piece-placement part of the FEN.
We tested this on 72 chessboard screenshots and compared each output against ground truth generated from the PGN.
At first, the results were noisy because we used a 1024-token output limit. That was too low for high-reasoning runs, so some models ran out of tokens before producing the final structured answer.
| Run | Accuracy | Exact / Eval | Notes |
|---|---|---|---|
| GPT-5.4 low summary | 100.00% | 72 / 72 | best overall |
| GPT-5.4 default | 93.06% | 67 / 72 | no reasoning summary |
| Claude xhigh summary | 76.39% | 55 / 72 | 6 parse errors |
| Claude high summary | 69.44% | 50 / 72 | 4 parse errors |
| Claude default | 66.67% | 48 / 72 | 1 parse error |
| GPT-5.4 xhigh summary | 4.17% | 3 / 72 | 69 parse errors |
| Claude max summary | 2.78% | 2 / 72 | 70 parse errors |
The low scores for GPT xhigh and Claude max were mostly truncation failures, not actual vision failures. The models were using too many tokens and often did not reach the final answer.
So we increased the output limit to 4096 tokens.
| Run | Accuracy | Notes |
|---|---|---|
| GPT-5.4 low summary | 100.00% | best overall |
| GPT-5.4 default | 97.22% | strong, no summaries |
| GPT-5.4 xhigh summary | 94.44% | strong but expensive, still some parse errors |
| Claude max summary | 84.72% | best Claude run |
| Claude high summary | 80.56% | no parse errors, still mapping mistakes |
| Claude xhigh summary | 79.17% | similar to high |
| Claude default | 63.89% | weaker without thinking |
| Claude medium summary | 58.33% | worst Claude config |
Claude improved after increasing the token budget, especially at max thinking. But the accuracy still did not catch up to GPT.
That made us look more closely at the actual thinking traces and outputs instead of only looking at aggregate accuracy.
GPT's traces usually stayed procedural and row-specific:
For row 8, I see: a8 has a black rook, f8 has a black rook, h8 has a black king...
Moving to rank 2, I see the white pawns at a2, b2, c2, with gaps where pieces have moved...
Claude's traces were more often global descriptions of the position:
I'm looking at a chess board layout with pieces positioned across the rows, showing what appears to be a mid-game or puzzle position...
Black has pawns scattered across the board with the king on f6, white has a rook on d1, a king on e3, and a few pawns positioned strategically.
This helped explain the remaining errors. Claude usually understood the board as a chess position, but it was less reliable at preserving the exact square-by-square layout.
When we manually checked the failed cases, the mistakes were mostly local: pieces shifted by one file, one extra or missing empty square in a row, moved-pawn gaps missed, correct piece family but wrong square, or correct piece but wrong color or case.
For example:
| Expected | Claude Output | What Went Wrong |
|---|---|---|
p1b1pk2 |
p2b1pk1 |
empty-square counts shifted |
PP3P2 |
PP4P1 |
pawn moved one file over |
1BNP1N1P |
1BNB1N1P |
wrong piece at a specific square |
That is when the pattern became clear: Claude was not failing at general chess understanding. It was failing at precise spatial localization.
This matched Anthropic's own Claude vision documentation:
Spatial reasoning: Claude's spatial reasoning abilities are limited. It may struggle with tasks requiring precise localization or layouts, like reading an analog clock face or describing exact positions of chess pieces.
Source: https://docs.anthropic.com/en/docs/build-with-claude/vision

That line maps almost exactly to our experiment. FEN generation requires exact localization across 64 squares. If a model recognizes the board but shifts one piece by one file, the FEN is still wrong.
So the short version is: we started by testing VLMs for chessboard-to-FEN extraction, fixed a token-budget issue in the evaluation, then manually inspected the traces and failed outputs. That inspection showed that Claude's main weakness was not formatting or global understanding. It was precise spatial reasoning, which Anthropic already documents as a limitation of Claude vision.