← Field Notes / VLM Evaluation

Strong general purpose VLMs can still fail on downstream vision tasks

A small chessboard-to-FEN experiment showed Claude Opus 4.7 struggling with exact square-level localization even when it understood the board broadly.

Sankalp Nagaonkarfield noteMay 19, 2026 · 6 min

We were building Chess Lens, an educational chess assistant that could look at a board position and help guide a player through it.

The idea was to use VideoDB to ingest the chessboard visually, follow the game state over time, and turn the current position into structured data for teaching, analysis, and next-move guidance. Before any of that could work, the system needed to read the board accurately.

To validate that board-reading step, we evaluated different thinking configurations for GPT-5.4 and Claude Opus 4.7 on the same board-reading task. Claude Opus 4.7 consistently trailed the strongest GPT runs, so we looked more closely at where its outputs were failing.

Understanding is not equal to precision in chessboard-to-FEN extraction

FEN and what we measured

Chess Lens needed the current board position in FEN, a standard notation for one exact chess position. FEN records the board as rows of piece letters and empty-square counts. For example, a row like PP3P2 means two white pawns, three empty squares, one white pawn, and two empty squares.

For this evaluation, we only measured the piece-placement field of FEN. The model received a board image and had to return which piece was on which square. We compared that output against the expected FEN for the same point in the game.

That makes the evaluation strict in the way Chess Lens needs it to be strict. If a pawn is shifted from one file to the next, the position may still look plausible in prose, but the FEN is wrong.

Visual explanation of how chessboard-to-FEN extraction works

Setup

We evaluated a chess game across 72 board positions. Each evaluated model had to determine board orientation, identify every visible piece, assign each piece to a square, and emit the piece-placement field of FEN.

The score is exact-match accuracy against the expected FEN sequence for the game.

Token budget changed the result

The first run used a 1024-token output limit. That was too low for high-reasoning configurations, so some models ran out of tokens before producing the final structured answer.

Run	Accuracy	Exact / Eval	Notes
GPT-5.4 low summary	100.00%	72 / 72	best overall
GPT-5.4 default	93.06%	67 / 72	no reasoning summary
Claude Opus 4.7 xhigh summary	76.39%	55 / 72	6 parse errors
Claude Opus 4.7 high summary	69.44%	50 / 72	4 parse errors
Claude Opus 4.7 default	66.67%	48 / 72	1 parse error
GPT-5.4 xhigh summary	4.17%	3 / 72	69 parse errors
Claude Opus 4.7 max summary	2.78%	2 / 72	70 parse errors

The low scores for GPT xhigh and Claude Opus 4.7 max were mostly truncation failures, not vision failures. The models used too many tokens and often did not reach the final answer.

We increased the output limit to 4096 tokens and reran the evaluation.

Run	Accuracy	Notes
GPT-5.4 low summary	100.00%	best overall
GPT-5.4 default	97.22%	strong, no summaries
GPT-5.4 xhigh summary	94.44%	strong but expensive, still some parse errors
Claude Opus 4.7 max summary	84.72%	best Claude Opus 4.7 run
Claude Opus 4.7 high summary	80.56%	no parse errors, still mapping mistakes
Claude Opus 4.7 xhigh summary	79.17%	similar to high
Claude Opus 4.7 default	63.89%	weaker without thinking
Claude Opus 4.7 medium summary	58.33%	worst Claude Opus 4.7 config

Aggregate scores were not enough

Claude Opus 4.7 improved after increasing the token budget, especially at max thinking. But the accuracy still did not catch up to GPT.

That pushed us to inspect the intermediate reasoning summaries and failed outputs instead of relying only on aggregate accuracy.

GPT's reasoning summaries usually stayed procedural and row-specific:

For row 8, I see: a8 has a black rook, f8 has a black rook, h8 has a black king...

Moving to rank 2, I see the white pawns at a2, b2, c2, with gaps where pieces have moved...

Claude Opus 4.7's reasoning summaries were more often global descriptions of the position:

I'm looking at a chess board layout with pieces positioned across the rows, showing what appears to be a mid-game or puzzle position...

Black has pawns scattered across the board with the king on f6, white has a rook on d1, a king on e3, and a few pawns positioned strategically.

This helped explain the remaining errors. Claude Opus 4.7 usually understood the board as a chess position, but it was less reliable at preserving the exact square-by-square layout.

The failure pattern

Manual inspection showed mostly local errors: pieces shifted by one file, one extra or missing empty square in a row, moved-pawn gaps missed, correct piece family but wrong square, or correct piece but wrong color or case.

For example:

Expected	Claude Opus 4.7 Output	What Went Wrong
`p1b1pk2`	`p2b1pk1`	empty-square counts shifted
`PP3P2`	`PP4P1`	pawn moved one file over
`1BNP1N1P`	`1BNB1N1P`	wrong piece at a specific square

A single square mismatch invalidates the FEN row

That pattern matters because FEN compresses each row using piece letters and empty-square counts. For example, PP3P2 means two pawns, three empty squares, one pawn, and two empty squares. If Claude Opus 4.7 outputs PP4P1, the row still looks plausible, but the pawn has shifted by one file and the FEN is wrong.

Claude Opus 4.7 was not failing at general chess understanding. It was failing at precise spatial localization.

The documented limitation

This matches the Claude vision documentation:

Spatial reasoning: Claude's spatial reasoning abilities are limited. It may struggle with tasks requiring precise localization or layouts, like reading an analog clock face or describing exact positions of chess pieces.

Source: https://docs.anthropic.com/en/docs/build-with-claude/vision

That line maps closely to this experiment. FEN generation requires exact localization across 64 squares. If a model recognizes the board but shifts one piece by one file, the FEN is still wrong.

Takeaway

Chess Lens exposed a narrow but important failure mode: understanding the board is not enough when the output format requires exact coordinates.

Claude Opus 4.7 often produced outputs that looked reasonable at the position level, but FEN is evaluated at the square level. One shifted piece changes the board state. One wrong empty-square count changes the row. The model can be directionally right and still fail the task.

The takeaway is that downstream tasks need their own evaluations. The best general-performing model is not automatically the best model for the specific task you are trying to solve.

Citation

Please cite this work as:

S Nagaonkar, "Strong general purpose VLMs can still fail on downstream vision tasks",
VideoDB Labs, May 2026.

Or use the BibTeX citation:

@article{nagaonkar2026strongGeneralPurpose,
  author = {S Nagaonkar},
  title = {Strong general purpose VLMs can still fail on downstream vision tasks},
  journal = {VideoDB Labs},
  year = {2026},
  note = {https://labs.videodb.io/engineering/field-notes/claude-chessboard-spatial-reasoning},
}