field note / VLM Evaluation

Claude, chessboards, and spatial reasoning

A small chessboard-to-FEN experiment showed where Claude vision struggles: not global understanding, but exact square-level localization.

We were experimenting with a real-time chess use case: take a screenshot of a chessboard and convert the visible position into FEN.

The model had to detect whether white or black was at the bottom, read the board row by row, map every piece to the correct square, and output the piece-placement part of the FEN.

We tested this on 72 chessboard screenshots and compared each output against ground truth generated from the PGN.

At first, the results were noisy because we used a 1024-token output limit. That was too low for high-reasoning runs, so some models ran out of tokens before producing the final structured answer.

Run Accuracy Exact / Eval Notes
GPT-5.4 low summary 100.00% 72 / 72 best overall
GPT-5.4 default 93.06% 67 / 72 no reasoning summary
Claude xhigh summary 76.39% 55 / 72 6 parse errors
Claude high summary 69.44% 50 / 72 4 parse errors
Claude default 66.67% 48 / 72 1 parse error
GPT-5.4 xhigh summary 4.17% 3 / 72 69 parse errors
Claude max summary 2.78% 2 / 72 70 parse errors

The low scores for GPT xhigh and Claude max were mostly truncation failures, not actual vision failures. The models were using too many tokens and often did not reach the final answer.

So we increased the output limit to 4096 tokens.

Run Accuracy Notes
GPT-5.4 low summary 100.00% best overall
GPT-5.4 default 97.22% strong, no summaries
GPT-5.4 xhigh summary 94.44% strong but expensive, still some parse errors
Claude max summary 84.72% best Claude run
Claude high summary 80.56% no parse errors, still mapping mistakes
Claude xhigh summary 79.17% similar to high
Claude default 63.89% weaker without thinking
Claude medium summary 58.33% worst Claude config

Claude improved after increasing the token budget, especially at max thinking. But the accuracy still did not catch up to GPT.

That made us look more closely at the actual thinking traces and outputs instead of only looking at aggregate accuracy.

GPT's traces usually stayed procedural and row-specific:

For row 8, I see: a8 has a black rook, f8 has a black rook, h8 has a black king...

Moving to rank 2, I see the white pawns at a2, b2, c2, with gaps where pieces have moved...

Claude's traces were more often global descriptions of the position:

I'm looking at a chess board layout with pieces positioned across the rows, showing what appears to be a mid-game or puzzle position...

Black has pawns scattered across the board with the king on f6, white has a rook on d1, a king on e3, and a few pawns positioned strategically.

This helped explain the remaining errors. Claude usually understood the board as a chess position, but it was less reliable at preserving the exact square-by-square layout.

When we manually checked the failed cases, the mistakes were mostly local: pieces shifted by one file, one extra or missing empty square in a row, moved-pawn gaps missed, correct piece family but wrong square, or correct piece but wrong color or case.

For example:

Expected Claude Output What Went Wrong
p1b1pk2 p2b1pk1 empty-square counts shifted
PP3P2 PP4P1 pawn moved one file over
1BNP1N1P 1BNB1N1P wrong piece at a specific square

That is when the pattern became clear: Claude was not failing at general chess understanding. It was failing at precise spatial localization.

This matched Anthropic's own Claude vision documentation:

Spatial reasoning: Claude's spatial reasoning abilities are limited. It may struggle with tasks requiring precise localization or layouts, like reading an analog clock face or describing exact positions of chess pieces.

Source: https://docs.anthropic.com/en/docs/build-with-claude/vision

Claude vision documentation showing the spatial reasoning limitation

That line maps almost exactly to our experiment. FEN generation requires exact localization across 64 squares. If a model recognizes the board but shifts one piece by one file, the FEN is still wrong.

So the short version is: we started by testing VLMs for chessboard-to-FEN extraction, fixed a token-budget issue in the evaluation, then manually inspected the traces and failed outputs. That inspection showed that Claude's main weakness was not formatting or global understanding. It was precise spatial reasoning, which Anthropic already documents as a limitation of Claude vision.