Improving Product Discovery of Tabletop RPG Maps (Preliminary Investigation)

As catalogs for digital marketplaces grow in size, customers have greater difficulty in finding products that meet their needs. Marketplace owners might improve product discovery by adding new categories, tags, or other searchable and filterable metadata to product descriptions. As it is often difficult to get product owners to update old product descriptions, a marketplace owner may use automated methods to “backfill” product data. We investigate using multi-modal computer models to extract data from digital products, specifically, digital maps for tabletop role playing games. Across five tasks, we find that models perform comparatively with their ranking on the Open VLM Leaderboard. State-of-the-commercial art models perform well with zero-shot feature extraction and image association/captioning tasks, but all models perform poorly with reasoning and quantitative tasks.

Introduction

Tabletop role playing game maps primarily serve two roles: a way for players and the game master to track positions of characters and creatures via miniatures or tokens (see battlemap below), or a depiction of an area for the game master’s use as they run an adventure which denotes buildings, catacombs, traps, and locations of creatures (see dungeon map below). The former is either printed and placed in the center of the table during play or displayed using virtual tabletop (VTT) software.

view of a building with interior garden and rooms with chairs tables and benches

Example Battlemap/VTT-style Map (source: Owlette Publications LLC)

a dungeon with three rooms and two caverns

Example Game Master Dungeon Map

Game masters need to marry a map with their intended adventure and campaign setting, so the choice of a map is multi-factor decision. Although game masters often make their own maps for an adventure, high-quality maps require skill, creativity, and can be time-consuming to make, so game masters may instead purchase maps. Multiple marketplaces exist to sell maps created by independent publishers. The marketplaces support product discovery via categorization filters, free-form text search, and limited recommendation systems.

Since maps are digital goods and marketplaces have access to the product (marketplaces store the product within their databases), a marketplace may use computer vision technology to extract product data and augment their product discovery tools and database. As a means to maintain accuracy, publishers could approve any additional data being added to the product description or database. A possible activity flow for this process could be:

  1. Publisher creates product description and uploads product to Marketplace
  2. Model (mediated by the Marketplace) processes the product and computes additional metadata
  3. Publisher examines and approves additional metadata
  4. Customers use tools provided by the Marketplace to discover products

Computer vision technology has improved greatly in the past few years, particularly in its ability to identify and extract objects and other features from an image. The combination of disparate neural network architectures allows model to interpret image data with semantically-laden text, greatly improving performance on tasks such as image captioning and visual question and answering.

Methodology

As a preliminary study, our objective is to gain a sense of the current quality of existing models for this domain, gain insight on what engineering efforts will be required for a large scale project, and characterize some of the likely challenges such a project would encounter.

We evaluated zero-shot performance of five existing multi-modal neural networks using five prompts or tasks. The prompts were:

  1. List locations or spaces featured within the map. Format as a bullet list.
  2. List items featured within the map. Format as a bullet list.
  3. In units of the listed map scale or map grids, if they exist, what are the dimensions of this map?
  4. Is movement within this map mostly linear or mostly branching?
  5. List types of adventures or scenarios this map is suitable for. Format as a bullet list.

The first two prompts focus on identifying and extracting features, the former focusing on locations where characters may be placed, while the latter on items the characters might pick up or interact with. The third prompt involves interpreting, quantitatively, a map grid or scale. The fourth prompt involves semi-subjective interpretation of the map features along with the external concept of movement. The fifth prompt tests themed open-ended captioning.

We selected five multi-modal LLMs for the evaluation (see table below). Since we would like to determine a necessary “minimum strength,” we selected models across a variety of benchmark levels. The Avg Score column in the table acts as a proxy for the model’s general strength (source: OpenVLM Leaderboard).

Short Name Model Avg Score Type
Gemini Gemini-2.5 Pro Preview 05-06 80.1 API
GPT GPT-4.1 20250414 75.9 API
Pixtral Pixtral-12B 61 OS
Gemma3 Gemma3-4B 55.4 OS
LLaVA LLaVA-v1.5-7B 36.9 OS

All models were evaluated using the same core prompts. We adjusted the prompts to fit model requirements (e.g. adding ‘USER’ and ‘ASSISTANT’ breaks), but the core prompt was constant.

We did not supply contextual data to any of the models. Each prompt was presented as a stand-alone invocation rather than a conversation to avoid interacting effects. Some models required converting images to RGB mode first, but we otherwise did not scale, crop, or convert the images for a specific model.

For our dataset, we randomly selected products from a digital marketplace. All products were created between 2022 and 2024, inclusive. We dropped products from our selection if they were city or world maps or were not maps (e.g. collections of images meant to build new maps). Since most products include multiple maps and map variants (e.g. day and night), we selected a single map from each product, preferring ones meant for VTT use and variants with labeled content.

Processed File Size (MB) Width x Height (pixels) Type
M001.jpg 1.2 1800 x 3648 VTT
M002.jpg 10.0 6000 x 4800 VTT
M003.jpg 1.5 2380 x 2380 VTT
M004.jpg 0.9 2160 x 2880 VTT
M005.jpg 8.1 8400 x 8400 VTT
M006.jpg 0.9 2523 x 3220 GM
M007.jpg 2.1 2523 x 3220 GM
M008.jpg 0.6 1400 x 1400 VTT

We ran the three open-source models on a 16-core, 64-gb virtual machine without an attached GPU. Our code used the HuggingFace transformer library. We called the two API models using their commercial endpoints.

Results and Evaluation

For each of the five prompts, we evaluated the quality of output with prompt-specific measures and human judgement. Although we selected the sample images randomly, this study does not include sufficient data points for a signficance test. Instead, as a preliminary study, we want a rough measure of baseline capability in order to identify potential research directions and likely necessary data engineering efforts.

For transparency, we report the output of the models for each task within a table. However, the output has been edited for uniformity and space. The models often output filler sentences, repeat themselves, and change formatting. Any large-scale application of these models will require an engineering effort to standardize the outputs.

Inference Time and Money

Inference times in seconds, not including any file upload times, were:

Model Min. Inf Time Med. Inf Time Max. Inf Time
Gemma3 118 122 189
LLaVA 235 261 326
Pixtral 905 1193 1425
Gemini 31 49 290
GPT 33 39 50

Our intent here is not to rigourously benchmark inference performance but to give a sense of the scale of latencies. The APIs, which are presumably tuned and run on high-end machines, needed thirty to sixty seconds to evaluate five prompts on a image. Run locally on a high-end machine that lacks a GPU, inferences took multiple minutes to complete.

The marginal cost of calling the API services was between $0.002 to $0.003 per image. The APIs use a variety of factors to calculate a cost in terms of tokens, but this calculation is dominated by the size of the image in pixels.

To compute the marginal cost using for local inference, we use the cost of running a virtual machine per hour and processing throughput. The virtual machine used for this study costs $0.54/hour. This translates to costs from $0.018 to $0.18 per image processed, depending on the model.

We expect an operator could process 10,000 images for less than $100.

Prompt 1: Locations

List locations or spaces featured within the map. Format as a bullet list.

This task tests the model’s performance at finding map features that are conceptually a “room” or defined space. Characters should be able to occupy a location. For each response, we apply a subjective criteria that the list describes rooms that are accurate and meaningful in a product query.

We’ve summarized the results into the table below. Both Gemini and GPT extracted useful details consistently from the images, although tuning or other filters may be required to make the outputs useful. The three hosted models typically had poor quality. When labels were present, all models preferred to extract those labels rather than interpret details in the image.

Processed File Gemma3 LLaVA Pixtral Gemini GPT
M001.jpg Fail Pass Fail Fail Pass
M002.jpg Fail Fail Fail Pass Pass
M003.jpg Fail Pass Fail Pass Fail
M004.jpg Fail Fail Fail Pass Pass
M005.jpg Fail Fail Fail Pass Pass
M006.jpg Fail Fail Fail Pass Pass
M007.jpg Fail Fail Pass Pass Fail
M008.jpg Fail Pass Fail Pass Pass

Details

M001 shows the deck of a ship. There is one room to the stern with a hammock, table, barrles and a crate. There is also a small room to the stern holding an anchor. The middle area hosts four cannon, two to a side, along with more barrels, crates, a ladder, and a grate. The aft is separated into two rooms, each with a hammock and a table with chair. All the models pick up on the naval theme. Pixtral picks up on the contents of rooms, but none of the rooms themselves. Gemma3 reports a “Captain’s Quarters” and a “Navigation Station,” but it is unclear what room is affiliated with either description. Similarly, Gemini uses general terms to refer to sections of a boat, without reference to specific rooms on the map, and reports “lifeboats/rafts”. GPT fares better by referring to “Crew quarters” and “Private cabins” without suggesting specific roles.

M002 is a cliff-face, dominated by a large skull where the mouth opens into a cave. The cliff face is embedded with single-room buildings connected by stairs. The ground features a small camp and some streams crossed with bridges. None of the models pick up on the skull imagery, although all models except Pixtral report an entrance to a cave. Gemma3 fails to identify the buildings and the camp while LLaVA combines these features into a generic “small village” label as well as scales the small rocks in the streams into islands. GPT provides the most complete and detailed list, followed by Gemini.

M003 is a large tree with a small opening that leads to an interior, hollowed-out room that is likely a lair for a creature. LLaVA directly extracts the hollowed-out nature of the room (albeit encounters a model loop and keeps repeating the fact). Gemini reports the fact somewhat indirectly, referring to it as a “central clearing/nest”. The other three models report the existence of the tree and note other elements of the image, but not the interesting part of the image — the hallow.

M004 is a complex map, featuring a bridge leading to a fortress of rooms, with internal paths and stairways connecting detailed rooms. LLaVA fails and simply repeats “A dungeon with a staircase” repeatedly. Other than Gemini, the other models associate rooms with functions in a non-apparent manner. Is a room with a table and three seating places the “dining hall”? Which room is a “jail” or “prison”?

M005 is a difficult map for a human to interpret since the color range is low contrast and the buildings are largely collapsed with few distinguishing marks. Of the open source models, Pixtral does best by listing “structures”. None of the open source models identify the stream or river in the picture, but both API models do. Gemini’s descriptions are matter-of-fact, while GPT’s are more detailed and add color (not necessarily supported by the image).

Both M006 and M007 use traditional RPG cartography as found in TSR publications in the 80s and 90s (similar to the GM map above). Rooms are labeled and, while doors, traps, and secret passages are marked, rooms are not populated with images of their contents. Gemini and GPT extracted all the room labels and other markers from the map, however, GPT also added labels to each room number with descriptions that fail to follow the actual room layouts. Of the three opensource models, Pixtral performed the best by extracting 80% in M006’s case and one-third of the rooms in M007’s case. From a product discovery, extracting room labels is not useful by itself, as customers are not searching for a map that includes “room 33”. However, customers may be interested in maps with a certain number or range of rooms, so the output may be useful as part of an analysis pipeline.

Map M008 is a octagular bar or tavern. Gemma3 and Pixtral extract elements of the image, but fail to extract the meaning of the elements. LLaVA, Gemini, and GPT identify the bar and the other dining/drinking elements.

Responses

Processed File Ground Truth Gemma3 LLaVA Pixtral Gemini GPT
M001.jpg (5 locations)
Three living areas/berths
Midship/Cannon firing
Anchor storage
Deck
Captain’s Quarters
Navigation Station
Cabin
Storage area
Wooden planks
Sail
Anchor
Rope
Sword
Ship
Cannon
Barrel
Chest
Sail
Barrel
Anchor
Rope
Cannon
Ladder
Grating
Wheel
Oar
Barrel
Barrel
Barrel
Barrel
Barrel
Barrel
Barrel
Barrel
Barrel
Barrel
Barrel
Deck
Cannon emplacements
Central grate/hatch
Bow (forecastle area with anchor, ropes, and supplies)
Stern (aftcastle area with cargo)
Lifeboats/rafts
Forecastle deck
Storage areas
Weapon/armory area
Cannon stations
Main deck
Stairwell/ladder access
Crew quarters
Private cabins or enclosed spaces
Miscellaneous deck equipment
M002.jpg House/Bedroom (5)
Bridge (3)
Entry to cave
Camp
Rivers
Large Mountain
Cave Entrance
River (Two connected streams)
A river flowing through the middle of the map
A waterfall in the middle of the map
A cave with a waterfall inside
A bridge over the river
A small village with buildings and a water source
A large mountain with a waterfall on it
A small island in the middle of the river
A large island in the middle of the river
A small island in the middle
House with multiple windows
Open area with a well
House with a large door
House with a small entrance
House with a large entrance
House with a small window
House with a large entrance and a small window
House with a small entrance and a small window
House with a large entrance and a small window
House with a small entrance and a large window
House with a large
Cave entrance
River
Stone bridges
Buildings/Huts
Staircases
Tents/Makeshift shelters
Grassy areas/Vegetation
Rocky cliffs/Cave walls
Small pond/Pool of water
Central plaza/Open area
Cave entrance in the cliff face
Rocky cliffside and surrounding rock formations
Multiple interior rooms built into the cliffside
Room with beds and tables
Room with chests and barrels
Room with beds
Room with beds and a table
Room with crates and supplies
Water streams and river systems running through the map
Multiple stone bridges crossing the water
A central tent camp area with several tents
Grassland and forested areas surrounding the water
Scattered trees, shrubs, and stones throughout the landscape
M003.jpg Room inside tree hollow Central massive tree with a large root system
Small patches of grass
A large tree with a hole in the center
A small tree with a hole in the center
A tree with a hole in the center
A tree with a hole in the center
A tree with a hole in the center
A tree with a hole in the center
A tree with a hole in the center
A tree with a hole in the center
A tree with a hole in the
Central Tree Trunk
Upper Branches
Lower Branches
Surrounding Bushes
Ground Cover
Small Rocks
Fallen Leaves
Small Plants
Grass Patches
The Great Tree
Central Clearing/Nest
Surrounding Forest Floor
Tree Stumps
Bone Pile
Debris/Wreckage
Thickets/Bushes
Smaller Trees
Fallen Logs/Branches
Large central tree stump
Exposed and sprawling tree roots area
Central pool or depression with muddy water
Broken wooden debris in the central depression
Small clusters of bushes and shrubs around the roots
Scattered tree stumps throughout the map
Various fallen logs and branches
Surrounding dense forest/wooded edge
Patches of colorful foliage
Open grassy clearings between roots and trees
M004.jpg Bridge
Gate
Magic defense rings
10 Rooms
Main Entrance
Grand Hall
Library
Armory
A dungeon with a staircase
A dungeon with a staircase
A dungeon with a staircase
A dungeon with a staircase
A dungeon with a staircase
A dungeon with a staircase
A dungeon with a staircase
A dungeon with a staircase
Entrance/Exit
Hallway
Room with Red Light
Control Room
Storage Room
Laboratory
Prison Cells
Torture Chamber
Dungeon
Bridge
Chasm
Chasm/Abyss
Fortified walkways/ledges along the chasm
Balconies with glowing teal orbs and spikes
Rooms (possibly barracks, studies, or chambers)
Stairways
Areas with barrels and crates (storage)
Circular platforms with magical symbols
Wooden bridges or walkways connecting upper levels
Corridors/Passages
Small alcoves or guard posts along the walls
Entry hallway/corridor
Main hall or crossroads
Stairs
Small bedrooms or chambers
Armory or storage rooms
Dining area or mess hall
Office or study
Ritual or summoning room
Jail or holding cells
Guard posts or sentry rooms
Bridge
Balcony/ledge overlooking a chasm
Arcane or crystal chambers
Tentacle-filled chasm/abyss
Storage alcoves with barrels and sacks
M005.jpg Rooms
Stream
Central Chamber
Collapsed Tunnel 1
Collapsed Tunnel 2
A large, dark room
A small, dark room
A large, dark room with a clock
A small, dark room with a clock
A large, dark room with a clock and a small, dark room with a clock
A large, dark room with a clock and a small, dark room with a clock
A large, dark room with a clock and a small, dark room with a clock
Top-left circular structure
Top-right circular structure
Top-center rectangular structure
Middle-left square structure
Middle-right square structure
Bottom-left square structure
Bottom-right square structure
Central irregularly shaped area
Glowing Purple River
Star-Patterned Circular Chamber
Staired Circular Dais/Platform
Northern Ruins
Southern Ruins
Rocky Island
Circular ritual or summoning chamber
Large ruined rooms or halls
Central underground river or stream
Rocky outcroppings and cave formations along the river
Isolated rocky island or promontory (center of river)
Small ruined structure or shrine
Passageways and corridors connecting various rooms and spaces
Arched or circular entrance/exit
Crumbled and partially-collapsed areas throughout map
M006.jpg Rooms (60) 11
12
13
14
10..27 1..19a 1..58 (all room labels) Room 1 .. Room 60
Multiple staircases
Multiple well/circle rooms
Several small antechambers and corridors
Spaces marked "b" and "c" (possibly doors or barriers)
Decorative/statue symbols in some rooms
M007.jpg Rooms (24)
Forest area
24
23
22
21
23
24
Lists 25..40 (none exist)
1..20 1..24 (all room labels) 24 entries described like so. Room descriptions do not match the map
1. Cave entrance (northernmost entry)
2. Small cavern
3. Room with barrier/door
4. Chamber with central structure
5. Large winding corridor
M008.jpg Outside table (4)
Inside bar
Octagonal Table
Stone Floor
Walls
Chairs
A large wooden structure
A dining area with tables and chairs
A bar with a variety of drinks
A large wooden table
A large wooden table
A large wooden table
A large wooden table
A large wooden table
Central Table
Barrels
Benches
Tables
Floor
Central Hexagonal Building/Bar
Bar Counter
Service Area (behind the bar)
Spiral Staircase
Seating Area (inside)
Awnings
Outdoor Paved Area
Outdoor Tables
Central main building/structure
Central bar or serving counter
Multiple colored seating areas (tables and chairs) around the main room
Four exterior market stalls or vendor stands
Four exterior round tables with benches
Surrounding open plaza or paved area

Prompt 2: Items

List items featured within the map. Format as a bullet list.

This task is intended to evaluate sub-image identification within a larger image, focusing on images that are “items” — things characters could interact with, versus rooms the characters could inhabit. This is a feature associated with “battle maps” or maps that will be seen by players, versus maps only seen by the game master. Thus, we remove M005, M006, and M007 from the evaluation of this task. M005 is excluded because the maps lacks details of items and M006 and M007 are not battle maps.

This task led the LLaVA model to repeat itself. For clarity, we have removed duplicate entries.

For this task, we use Jaccard similarity, also called intersection over union, to quantify how effectively the models are finding all the items. The score ranges from zero to one, where one would indicate the model replied with all of the items in the ground truth and without any extras. Mathematically:

$$ J(A, B) = \frac{|A \cap B|}{|A \cup B|} $$

For our evaluation, we ignore plurality and counts. For example, “cannon” is treated as a correct answer for M001, as well as “cannons” and “cannon (6)”, even though there are only four.

The scores for the models on M001, M002, M003, M004, and M008:

Processed File Gemma3 LLaVA Pixtral Gemini GPT
M001.jpg 2 / 14 (0.14) 5 / 15 (0.33) 6 / 16 (0.38) 10 / 14 (0.71) 12 / 14 (0.86)
M002.jpg 0 / 7 (0) 0 / 11 (0) 1 / 10 (0.10) 4 / 16 (0.25) 4 / 13 (0.31)
M003.jpg 0 / 4 (0) 0 / 4 (0) 1 / 11 (0.09) 3 / 11 (0.27) 2 / 11 (0.18)
M004.jpg 0 / 20 (0) 0 / 20 (0) 2 / 20 (0.10) 20 / 20 (1.00) 20 / 20 (1.00)
M008.jpg 2 / 7 (0.29) 2 / 5 (0.4) 3 / 7 (0.43) 5 / 9 (0.56) 4 / 11 (0.36)

For the open source models, their errors were often ones of omission – not reporting elements of an image. In contrast, the API model scores were damaged by reporting non-existent elements or reporting elements that did not fit within the definition of an item. This is a case where further prompt engineering might significantly improve the quality of the answers.

Responses

Processed File Ground Truth Gemma3 LLaVA Pixtral Gemini GPT
M001.jpg Hammock
Barrels
Crates
Sword
Table
Bottles
Anchor
Rope coil
Cannon (4)
Ladder
Grate
Mast bases
Wooden Deck
Anchor
Two Cannons
Cannon Balls
Wooden barrels
Sword
Anchor
Book
Rope
Clock
Chest
Barrels
Barrels
Anchor
Ladder
Grating
Ropes
Cannon
Oars
Chests
Wooden Planks
Wheel
Cannons
Wooden barrels
Wooden crates
Lifeboats (or folded sails)
Ladder
Grate
Mast bases (circular objects)
Anchor
Rope coil
Bottles
Cutlass/Sword
Planks/Decking (as part of the ship structure)
Cannons (6)
Barrels (several)
Wooden crates (several)
Hammocks (4)
Anchor
Rope coil
Table with a sword and bottles
Ladders (2)
Grate/hatch
Round hatches/openings (3)
Wooden planks (flooring)
Small chest/box
Small round table or stool
M002.jpg Tent
Bed
Table
Crates
A large cave
A river
Stone buildings
Waterfall
River
Water source
Cave
Mountain
Buildings
Trees
Several buildings, some with windows and doors
Multiple bridges crossing over waterways
A winding river or stream
Various patches of green vegetation
Rocky terrain with large boulders and outcrops
A few small structures resembling huts or shelters
Cave entrance
River/Stream
Waterfalls
Stone bridges (two)
Houses/Buildings
Stairs
Trees/Bushes
Grassy areas
Rocks/Boulders
Tents/Tarpaulins
Pond/Small body of water
Crates and Barrels
Beds
Tables and Chairs
Fireplaces
Pathways/Dirt tracks
Mountain/cliff face with a large cave entrance
Flowing river/stream starting from the cave and branching out
Stone bridges crossing the river/stream
Rocky terrain with patches of grass and greenery
Multiple tents set up near the river
Various stone buildings/rooms carved into the mountain/cliff sides
Furnishings inside the buildings such as beds, tables, barrels, crates, and sacks
Stairs leading up to the stone buildings/rooms
Ruined or broken bridge
Small pond or pool
Freestanding boulders and rocks scattered throughout
Trees and bushes dotting the landscape
Pathways and trails winding between features
M003.jpg Shrubs
Tree stumps
Pile of bones
A massive, ancient tree with a wide, gnarled root system A large tree with a hole in the center Large tree with extensive roots
Small plants
Bushes
Rocks
Fallen leaves
Branches
Moss
Flowers
Grass patches
Giant tree roots
Campfire
Pile of bones
Trees
Bushes
Tree stumps
Fallen logs
Mushrooms
Red berries or fruit
Grassy ground
Dirt/muddy patch
Giant tree stump with extensive roots
Central hollow area within the stump, partially filled with water or mud
Broken wooden cart or debris in the central hollow
Various bushes and shrubs scattered around the roots
Tree stumps from felled trees
Fallen logs and branches
Rocks and boulders
Diverse trees with autumn-colored foliage surrounding the area
Small patches of flowers or undergrowth
Leaf litter and scattered debris
M004.jpg Special A large, damp, stone room, reminiscent of a cavern A dungeon with a staircase Multiple rooms with various symbols and colors
Red symbol in a room at the top left
Yellow symbol in a room at the top center
Blue symbol in a room at the top right
Central corridor with two prominent rooms
Left room with a yellow symbol
Right room with a red symbol
Mechanical elements
Tentacle-like structures with green gems
Chains and
Stone walls and floors
Wooden bridges
Chains
Large, tentacle-like appendages
Barrels
Crates
Tables
Chairs
Rugs/Carpets
Beds or cots
Stairs
Torches or lanterns providing light
Books, scrolls, or papers
A pentagram or similar magical symbol on the floor
An abyss or chasm below the structures
Wooden planking or scaffolding
Doors or archways
Chests or coffers
Tapestries or cloth hangings
Pillars or columns
Balconies or ledges
Large, spiked, metallic objects
A circular stone platform or area
Wooden stairs and ladders
Stone staircases
Various tables and chairs
Rugs and carpets
Chests and crates
Beds
Bookcases
Barrels
Wall torches
Stone archways
Mounted shields and wall decorations
Desk with books and papers
Magical circles or glowing symbols
Large chains
Stone bridge over a chasm
Giant, glowing crystals embedded in fixtures
Massive tentacles emerging from the darkness below
Closed chests and cabinets
Armories or weapon racks
Sectioned and barred rooms/cells or vaults
M008.jpg Bar table
Chairs
Barrels
Bottle rack
Tables (round)
Octagonal Table
Wooden Chairs
Stone Flooring
Lantern
Tables
Chairs
Tables
Barrels
Mugs
Bench
Wooden Bar
Central hexagonal bar/kiosk structure
Wooden flooring
Cobblestone paving
Awnings
Barrels
Spiral staircase
Shelves/racks
Round wooden tables
Chairs/stools
Large central octagonal structure (building/tavern layout)
Hexagonal bar counter in the center
Barrels and mugs on the bar counter
Bar stools in red, blue, yellow, and green arranged around the bar counter
Booths or low couches/benches around the interior edge of the building
Semi-circular stairwell behind the bar
Four large market stalls/tents surrounding the octagonal building
Four round wooden tables with four chairs each located outside the building
Wooden plank floor inside the building
Cobblestone pavement outside the building and under the stalls/tables

Prompt 3: Grid Dimensions

In units of the listed map scale or map grids, if they exist, what are the dimensions of this map?

This task involves identifying a map feature (a grid) and then counting squares. If a model performs well at this task, a marketplace might be able to offer geographical information system-like filters.

Processed File Ground Truth Gemma3 LLaVA Pixtral Gemini GPT
M001.jpg 4x11 20x20 non answer 10x20 5x10 6x16
M002.jpg 40x30 800 squares non answer 40x30 30x20 40x22
M003.jpg 34x34 non answer 10x10 10x10 20x20 28x28
M004.jpg no grid non answer 1000x1000 10x15 no grid no grid
M005.jpg 60x60 non answer 1000x1000 40x40 30x30 28x20
M006.jpg 34x44 60x60 non answer 40x50 32x30 34x43
M007.jpg 34x44 non answer non answer 24x32 20x25 30x25
M008.jpg no grid non answer non answer 10x10 no grid no grid

All of the models failed this task. The two most advanced models, Gemini and GPT, successfully gave a “cannot be determined” answer when the maps lacked a grid or any scale marker. (Contrast this with “non answer” which uses filler phrases.) However, they were otherwise highly unreliable with their grid counts, even if they were close (numerical difference) in some cases.

This task will be difficult for LLMs. First, since LLMs work on a tiling of the source images, the image will usually be distorted in order to be standardized. Second, this task involves counting which neural networks are not ideal for. If this was an important element for product discovery, we expect a specialized model using a convolutional neural network (CNN) to extract the grids, or perhaps a more direct technique (e.g. Hough transform), would be efficient and reliable.

Prompt 4: Linear or Branching Movement

Is movement within this map mostly linear or mostly branching?

This tasks involves a higher-order understanding of the graph or connected nature of rooms, along with an understanding of how characters move along the network. Practically, this distinction is an important element of adventure design.

Processed File Ground Truth Gemma3 LLaVA Pixtral Gemini GPT
M001.jpg Linear Mostly linear Mostly branching Mostly linear Mostly linear Mostly linear
M002.jpg Branching Non answer Mostly linear Mostly branching Mostly branching Mostly branching
M003.jpg Linear Non answer Mostly linear Mostly branching Mostly branching Mostly branching
M004.jpg Branching Mix Mostly linear Mostly linear Mostly branching Mostly branching
M005.jpg Branching Mostly branching Mostly linear Mostly branching Mostly branching Mostly linear
M006.jpg Branching Non answer Mostly linear Mostly branching Mostly branching Mostly branching
M007.jpg Linear (Complex) Non answer Mostly linear Mostly branching Mostly branching Mostly branching
M008.jpg Linear Primarily linear Mostly linear Mostly linear Mostly branching Mostly branching

If we treat this task as a boolean classifier, Pixtral and Gemini both scored the best with 5 correct out of 8. GPT scored next with 4 correct out of 8 and Gemma3 and LLaVA scored 3 out of 8. (We graded Gemma3’s non answers as incorrect answers, but they could be graded more severely.) Overall, the results suggest performance no better than a coin flip. However, this is a subjective task and we expect that if more humans were involved to make a labeled dataset inter-rater reliability would be low.

The reasoning text provided by Gemini and GPT suggests the models are recognizing the graph-like nature of the maps and how movement flows through passages and doors. With further experimentation and a more developed dataset, higher performance may be unlocked.

Prompt 5: Scenarios

List types of adventures or scenarios this map is suitable for. Format as a bullet list.

This task is an open-ended generation of adventure ideas or classifications. This is similar to image caption generation. Alternatively, a marketplace owner could turn this into a classification problem by restricting the list of scenarios to a defined set. Publishers, of course, could use this technology to augment their own product descriptions with lists of creative ways the maps could be used in games.

As an open-ended generation task, we will use a metric modeled after signal-to-noise. (A method like CIDEr would be more rigorous, but more expensive to implement.) Each adventure or scenario line will be assigned to either the “signal” or “noise” category. The score will be the ratio between the count of the “signal” responses over the count of the “noise” responses plus one (to avoid division by zero). Mathematically:

$$ h = \frac{|s|}{|n| + 1} $$

Processed File Gemma3 LLaVA Pixtral Gemini GPT
M001.jpg 1/1 (1) 2/1 (2) 4/1 (4) 11/1 (11) 15/1 (15)
M002.jpg 1/1 (1) 1/14 (0.07) 2/3 (0.7) 12/2 (6) 12/1 (12)
M003.jpg 1/1 (1) 6/5 (1.2) 3/2 (1.5) 8/1 (8) 15/1 (15)
M004.jpg 1/1 (1) 10/1 (10) 5/1 (5) 8/1 (8) 15/1 (15)
M005.jpg 1/1 (1) 1/14 (0.07) 4/1 (4) 11/1 (11) 13/1 (13)
M006.jpg 1/1 (1) 1/14 (0.07) 3/1 (3) 12/1 (12) 20/1 (20)
M007.jpg 0/1 (0) 6/1 (6) 4/1 (4) 11/1 (11) 20/1 (20)
M008.jpg 1/1 (1) 4/9 (0.4) 0/12 (0) 10/1 (10) 12/1 (12)

Gemini and GPT both excel at this task. Gemini’s responses tend to be more “classification”-oriented while GPT tends to be more of a story prompt, although neither are fully consistent. Gemma3 scores low because the model outputs few responses. Pixtral has more consistently higher signal than LLaVA.

Processed File Portrayed Area Gemma3 LLaVA Pixtral Gemini GPT
M001.jpg Ship Pirate Ship Heist Pirate Adventure
Treasure hunt
Pirate Adventures
Historical Seafaring Missions
Underwater Exploration
Mutiny and Intrigue
Pirate Attack/Boarding Action
Naval Combat Encounter
Smuggling Operation
Mutiny
Escaping a Sinking Ship
Treasure Hunt/Retrieval
Assassination/Kidnapping Target on Board
Investigation/Mystery
Defending Against a Sea Monster Attack
Transport Mission
Stealth/Infiltration Mission
Pirate ship battles at sea
Naval skirmishes between rival crews
Boarding and raiding encounters
Ship-to-ship chases and cannon duels
Exploring a derelict or haunted vessel
Smuggling or contraband hiding missions
Mutiny or shipboard betrayal scenarios
Prisoner escape from the brig
Rescue mission for captured allies
Treasure transport and theft
Monster or sea creature attacks on the ship
Diplomatic or parley meetings on neutral waters
Undercover infiltration of an enemy ship
Investigating sabotage or sabotage attempts
Storm survival or shipwreck scenarios
M002.jpg Rooms on cliff face Bandit Ambush Dungeon crawl
Fantasy adventure
Exploration game
Role-playing game
Stealth game
Survival game
Puzzle-solving game
Strategy game
Horror game
Science fiction game
Post-apocalyptic game
Historical reenactment game
Cultural exploration game
Educational game
Fantasy Exploration
River Trade
Dungeon Crawl
Survival
Bandit Hideout/Raider Camp
Cultist Stronghold
Lost Tribe/Hidden Settlement
Monster Lair
Siege/Defense Scenario
Ambush Site
Exploration of Ancient Ruins
Rescue Mission
Source of a Problem
Negotiation/Diplomacy Hub
Secret Passage/Escape Route
Guardian of a Pass
Resource Scramble
Exploring a bandit or outlaw hideout nestled in the mountains
Assaulting a fortified goblin or orc camp in search of stolen goods or hostages
Investigating ancient ruins or a dwarven outpost hidden within a cliff face
Defending a mountain pass against an invading force or monster attack
Navigating a river and its cave system to reach a lost treasure or artifact
Mounting a rescue mission for prisoners held in one of the outbuildings or caves
Tracing the source of a corrupted river affecting nearby villages
Delving into a cave system rumored to house a dangerous creature or valuable resource
Delivering supplies or messages between isolated outposts during a time of war
Negotiating peace or trade with a faction that controls the mountain stronghold
Escorting a group of refugees or travelers through treacherous mountainous terrain
Uncovering a secret cult or forbidden ritual taking place within the caves
M003.jpg Room inside huge tree Monster Lair Fantasy adventure
Exploration of a magical forest
Quest for hidden treasure
Encounter with mythical creatures
Survival in a dangerous environment
Escape from a cursed tree
Battle between good and evil forces
Puzzle-solving in a mysterious setting
Cooperation between characters to overcome challenges
Role-playing game
Fantasy Exploration
Ambush Scenario
Treasure Hunt
Nature-based Quests
Monster Lair
Ritual Site
Sacred Grove/Ancient Tree Guardian
Ambush Point
Hidden Entrance
Clue to a Mystery
Survival Challenge/Rest Stop Gone Wrong
Negotiation with Fey or Nature Spirits
Ancient tree ritual site: Place for druidic ceremonies or summoning rituals.
Feywild crossing: Portal or gateway to the Feywild or another mystical plane.
Forest guardian’s lair: Home of a treant, dryad, or nature spirit the party must parley with or defeat.
Bandit or cultist camp: Group using the tree’s shelter for nefarious purposes.
Quest objective: Recover a magical item hidden among the roots or in the tree’s hollow.
Ambush encounter: Enemies attack as the party navigates the tangled roots.
Puzzle or trap location: Roots conceal clues or dangerous mechanisms.
Gathering place: Neutral meeting ground for woodland creatures, druids, or tribes.
Corrupted grove: The tree is dying or cursed, and restoring it is the mission.
Nature’s trial: Test for rangers, druids, or nature-themed characters to prove themselves.
Fey or nature festival: The site hosts celebrations or strange happenings.
Hidden entrance: Secret trapdoor or passage beneath the roots leading to an underground dungeon.
Monster lair: Home of giant spiders, ankhegs, blights, or similar forest-dwelling threats.
Vision or dream sequence: The tree appears in a prophetic dream or mystical vision.
Healing or wish-granting site: The tree bestows boons, heals, or answers deep questions.
M004.jpg Fortress and bridge Dungeon Crawl Dungeon crawl
Exploration of a haunted castle
Stealthy infiltration of a fortress
Escaping a prison or dungeon
A quest for hidden treasure
A battle between rival factions
A race against time to save a kingdom
A mission to rescue a captured princess
A journey through a magical portal
A test of courage
Dungeon Crawl
Escape Room
Stealth Mission
Investigation
Trap Disarmament
Infiltrating an Evil Cult’s Lair
Rescuing a Prisoner from a Fortified Dungeon
Assaulting a Cliffside Fortress
Stopping a Ritual to Summon/Control a Monster
Exploring an Abandoned, Monster-Infested Stronghold
Sabotaging an Enemy War Machine or Operation
Escaping from a Perilous Prison
Investigating the Source of a Strange Magical Phenomenon
Infiltrating a cultist stronghold hidden deep underground
Rescuing prisoners from a dungeon surrounded by monstrous tentacles
Investigating an ancient, forgotten fortress sitting atop a slumbering creature
Defending the fortress from a monstrous attack breaking through the floor
Escaping from a collapsing stronghold as tentacles breach the walls
Searching for magical artifacts hidden within a forbidden underground sanctum
Negotiating or allying with a cult that worships an eldritch being
Conducting a heist on a fortress protected by unnatural guardians
A ritual gone wrong unleashing tentacled horrors from beneath
Exploring a fortress built on the remains of a titanic, ancient creature
A final boss encounter with an eldritch, tentacled entity
Reclaiming a corrupted outpost for a noble house or order
Uncovering the secrets behind the fortress’s construction and its hidden dangers
Escorting an NPC through a treacherous, monster-infested stronghold
Sealing a rift to another dimension opening beneath the fortress
M005.jpg Ruins with stream Dungeon Crawl Dungeon crawl
Fantasy adventure
Horror-themed game
Steampunk setting
Science fiction adventure
Mystery solving
Puzzle-solving game
Exploration game
Survival game
Role-playing game
Board game
Strategy game
Tactical game
Turn-based game
Dungeon Crawl
Escape Room
Underground Exploration
Investigative Missions
Ancient Ruin Exploration
Cultist Lair Assault
Underdark/Subterranean Journey
Monster Hunt
Lost Civilization Discovery
Necromancer’s Stronghold
Shadowfell Excursion
Retrieval Mission
Guardian of a Forbidden Place
Escape from a Collapsing Dungeon
Ritual Disruption
Exploration of Ancient Ruins
Dungeon Delve
Underdark Adventure
Magical River Crossing
Rescue Mission
Cultist Lair
Haunted Expedition
Artifact Retrieval
Fugitive Chase
Environmental Challenge
Secret Society Base
Waterborne Encounter
Interdimensional Portal
M006.jpg Dungeon complex Escape Room Dungeon crawl
Fantasy adventure
Science fiction adventure
Steampunk adventure
Horror adventure
Mystery adventure
Puzzle-solving adventure
Exploration game
Role-playing game
Board game
Strategy game
Tactical game
City-building game
Survival game
Dungeon Crawl
Escape Room
Mystery Investigation
Classic Dungeon Crawl
Lost Temple or Sanctuary
Ancient Vault or Treasury
Wizard’s Labyrinth or Magical Research Facility
Cultist Stronghold/Lair
Prison Complex
Exploration of an Abandoned Fortress/City Core
Infiltration and Espionage Mission
Monster Lair Complex
Quest to Activate/Deactivate an Ancient Device
Gauntlet/Trial Site
Underground Bunker/Shelter
Classic dungeon crawl with traps, monsters, and treasure
Exploring an ancient, long-forgotten underground city
Infiltrating the lair of a villain or cult
Recovering a lost artifact from a crypt or tomb
Navigating a labyrinth filled with puzzles and challenges
Mapping out an old, abandoned wizard’s stronghold
Escaping from imprisonment in an underground complex
Investigating rumors of hauntings or strange occurrences
Conducting a rescue mission for captured allies or townsfolk
Searching for the hidden vault of a powerful sorcerer
Battling rival adventurers or treasure hunters
Encountering and negotiating with a secret underground society
Uncovering the mystery behind a sealed magical portal
Fighting through the defenses of an undead king’s throne room
Delving into the ruins to learn secrets of an ancient civilization
Clearing out a den of monsters threatening the surface
Preventing the summoning of a dangerous extraplanar entity
Discovering and dealing with a living dungeon (sentient architecture)
Retrieving lost knowledge from a hidden library or archive
Surviving a supernatural cataclysm that has engulfed the dungeon
M007.jpg Cave network and forest D&D 5e Exploring a mysterious underwater cave system
Navigating through a maze of underwater tunnels and chambers
Searching for hidden treasures or artifacts within the cave system
Encountering dangerous sea creatures and overcoming challenges
Solving puzzles and riddles to progress through the cave system
Escaping from a submerged temple or ruins
Exploration Quests
Navigation Challenges
Pirate Adventures
Maritime Commerce
Dungeon Crawl
Coastal Cave Exploration
Smuggler’s Den/Pirate Hideout
Monster Lair
Search and Rescue
Infiltration Mission
Treasure Hunt
Escape from Captivity
Geological Survey/Exploration
Ritual Site
Amphibious Assault
Hidden pirate or smuggler cave base
Lost treasure hunt on a remote island
Monster lair or nest exploration
Shipwreck survivors seeking shelter
Secret cult or ritual site
Rescue mission for captured NPCs
Forgotten tomb or ancient ruins
Lair of aquatic or amphibious creatures
Bandit stronghold or hideout
Gathering rare herbs or resources for a quest
Underwater or coastal infiltration scenario
Battle between rival pirate factions
Investigation of mysterious disappearances
Protecting island inhabitants from an invading force
Discovery of a natural wonder or magical anomaly
Navigating a maze of tunnels to escape pursuers
First-contact adventure with isolated island dwellers
A haunted or cursed cave system
Race to claim a newly discovered cave’s riches
Expedition to study unique island flora or fauna
M008.jpg Circular bar Diplomacy/Negotiation Pirate adventure
Treasure hunt
Medieval fantasy
Dungeon crawl
Board game
Role-playing game
Strategy game
Party game
Themed event
Restaurant or bar setting
Historical reenactment
Cultural celebration
Role-Playing Campaigns
Dungeons & Dragons adventures
Pathfinder scenarios
Fantasy and medieval role-playing games
Board Games
Strategy games with tactical combat
Miniature wargames
Fantasy-themed board games
Tabletop Simulations
Virtual tabletop role-playing sessions
Online gaming scenarios
Digital strategy games
Tavern Brawl
Information Gathering
Ambush/Assassination Attempt
Stealth/Heist
Secret Meeting
Festival Stall
Defending the Establishment
Investigation Hub
Recruitment Drive
Celebration/Gathering Spot
Meeting an important contact or informant in a lively tavern or public house
Gathering information or rumors from locals and travelers
Holding a negotiation, peace talk, or clandestine meeting
Participating in or breaking up a heated tavern brawl
Investigating suspicious activity, theft, or a murder at a busy establishment
Entertaining a crowd or taking part in a storytelling or bardic competition
Setting up an ambush for a villainous NPC
Surviving an unexpected attack or raid unfolding inside the tavern
Hosting a festival, market day, or other community celebration
Recruiting adventurers or mercenaries for a quest
Solving a puzzle or riddle as part of a secretive society’s initiation
Escaping from guards or pursuing enemies by blending into the crowd

Conclusion

Multi-modal large language models exhibit a range of quality. Product discovery applications will likely require, at least, models with late 2024 or 2025 state-of-the-art performance. For VTT-style maps, models can yield lists of rooms and items found within a map, although individual entries may be irrelevant for purpose and the lists often contain hallucinated or misinterpreted items. For GM-style maps, models extract labels with high-recall and precision, but other details may be imprecise. Our tests showed all models were poor at understanding scale or grid systems, although some could detect the existence of a grid. Similarly, reasoning about movement through the map may be beyond current LLMs, although we suspect this area needs more rigorous study. LLMs scored highly on generating scenario or story ideas tied to a map, suggesting this might be a straight-forward product discovery mechanism.

Notes

The dataset was not incorporated as training material into any model.