Using LLMs to Explain Historical Code: FLOW-MATIC Investigation

2024-12-14

The new generation of code assistance tools powered by Large Language Models (LLM)s may be useful in efficiently categorizing and translating historical software corpora. In this qualitative study, we examine LLM performance in explaining and translating FLOW-MATIC programs. We also test their reliability by injecting OCR-like errors into the text. Released in 1958, FLOW-MATIC targeted business applications and used a heavily English-inspired syntax that later influenced COBOL. We find that the models perform inconsistently for describing the key aspects of a program but fare better when translating business logic into SQL. LLMs appear robust against OCR-like errors in the source code, but we caution against relying on the current technology.

Motivation

Historical source code may contain insights into the development of algorithms, programming practices, and shed light into the computational needs and pursuits of organizations. However, this same source code is often poorly cataloged, unavailable in digital form (e.g. exists in the form of paper tape or punch cards), and will usually be written in archaic programming languages, thus frustrating efforts to understand and interpret it. In order for researchers to effectively “triage” piles of artifacts and obtain a limited understanding of the contents, researchers might use an LLM to help explain and translate the code.

Why FLOW-MATIC?

Developed by Dr. Grace Hopper and her team at Reminton Rand from around 1955 to 1958, FLOW-MATIC was an early language targeted at business applications. Internally, the language was called B-0 and experimented with an English-like syntax. Using approximately 33 verbs and coupled with an external data description language, programmers could implement sequential data processing logic (Sammet 1969).

We chose FLOW-MATIC for this study because:

The language is sufficiently obscure that there is unlikely to be any special-case logic for the language, and
The manual (Rand 1958) contains three highly documented programs that perform typical business logic, and
The English syntax may be generalizable to text mining of programming or process specifications.

Although FLOW-MATIC is an obscure and dead programming language (even Rosetta Code doesn’t include any examples), these programs and associated commentary may have been included in the AI’s background corpus. The manual is available online as a PDF and (Sammet 1969) reprints Problem 1. We have also seen snippets of the code incorporated into various online articles on the language and Grace Hopper’s history. However, we think it is safe to assume no engineering effort has been made to improve FLOW-MATIC related answers.

Methodology

We extracted three FLOW-MATIC programs from a PDF of (Rand 1958) and manually corrected translation errors to the best of our ability. Then, using the AI Assistant tool as part of IntelliJ, we sent select prompts to different AI models, attaching the FLOW-MATIC programs as part of the request.

For the corruption studies, we ran the original source code through our corrupt.py script. This tool randomly selects a corruption from a list and applies it to the text. If the corruption does not apply (i.e. produced no edits), then a different corruption was selected until the number of edits matches the desired number.

Source code for the programs, the corrupted sources, the AI model output, and our corruption script are available within the Gitlab repository.

Testing LLMs is tricky. LLMs are stochastic and highly sensitive to variations in prompts. As an apologetically qualitative study, our task specification is vague and we lack clear, rigorous performance metrics.

Explaining and Translating Historical Code

For this task, we asked the LLM to produce an English summary of a program. We are less interested in a breakdown of individual lines, but rather succinct statements to the key aspects of the business logic, the data model, and pre and post conditions of the input and output.

Problem 1 - Join Inventory and Prices

The first program described in (Rand 1958) joins an ordered sequence of inventory records with an ordered sequence of price records, yielding a file containing priced inventory and a file containing unpriced inventory. The program requires 18 steps and would have been accompanied by a series of packets describing the input and output data formats.

(0) INPUT INVENTORY FILE-A PRICE FILE-B ; OUTPUT PRICED-INV FILE-C UNPRICED-INV FILE-D ; HSP D .

(1) COMPARE PRODUCT-NO (A) WITH PRODUCT-NO (B) ; IF GREATER GO TO OPERATION 10 ; IF EQUAL GO TO OPERATION 5 ; OTHERWISE GO TO OPERATION 2 .

(2) TRANSFER A TO D .

(3) WRITE-ITEM D .

(4) JUMP TO OPERATION 8 .

(5) TRANSFER A TO C .

(6) MOVE UNIT-PRICE (B) TO UNIT-PRICE (C) .

(7) WRITE-ITEM C .

(8) READ-ITEM A ; IF END OF DATA GO TO OPERATION 14 .

(9) JUMP TO OPERATION 1 .

(10) READ-ITEM B ; IF END OF DATA GO TO OPERATION 12 .

(11) JUMP TO OPERATION 1 .

(12) SET OPERATION 9 TO GO TO OPERATION 2 .

(13) JUMP TO OPERATION 2 .

(14) TEST PRODUCT-NO (B) AGAINST ZZZZZZZZZZZZ ; IF EQUAL GO TO OPERATION 16 ; OTHERWISE GO TO OPERATION 15 .

(15) REWIND B .

(16) CLOSE-OUT FILES C , D .

(17) STOP . (END)

Notable aspects of this program:

There are two input files, inventory and prices. The output consists of a file with joined inventory data and pricing data and a secondary output of inventory lacking a price.
Product Numbers (PRODUCT-NO) and Unit Price (UNIT-PRICE) are the only two fields referenced within the code.
Input files are assumed to be sorted. If this assumption is violated, the unpriced output will be full of false positives.
Each product in this version will have exactly one inventory record. This constraint is relaxed in later programs.
The program uses self-modifying code as step 9 is overwritten in step 12.
HSP D refers to the high-speed printer present on the UNIVAC. The instruction “redirects” output to the printer rather than a tape.

LLM Summaries

Using the prompt Explain the code at the current caret, all three LLM models produced a detailed explanation covering each line of code. All three models referenced ‘ZZZZZZZZZZZZ’ as a special marker. The final summaries from each model (below) all invoke matching entries between an inventory and pricing file or data and that prices may be missing.

openai-gpt-4o: “This sequence executes a comparison loop to match items between an inventory and a pricing file, writing matched items with prices to one output file and unmatched items to another. It handles end-of-file conditions elegantly, ensuring all items are processed properly.”
google-chat-gemini-pro-1.5: “In essence, this code implements a matching algorithm, similar to a merge join in database systems. It efficiently processes inventory and pricing data, handling scenarios where prices might be missing for certain items.”
openai-chat-gpt: “Overall, the code snippet describes a simple matching process between inventory items and their prices, handling cases where prices may be missing in the price file.”

Translation into SQL

We used the prompt Translate the code at the current caret into SQL to translate Problem 1 into SQL. A good translation should capture the business intent, that of a join, as well as features of the data model.

We observed:

openai-gpt-4o: The output features two inserts based on joins (implicit inner join and left join) into a priced and unpriced tables. The output includes a ‘quantity’ field which is not present in the original code, although an inventory data record is highly likely to include such a field.
google-chat-gemini-pro-1.5: The output uses procedural SQL features and attempts to match the original code feature-by-feature, which requires more lines than the original FLOW-MATIC version. The only fields referenced are the product number and the unit price. Includes an ‘operation_9_target’ for emulating the self-modifying code, but the variable is only written, never read. The code also includes a useless block for checking for the ‘ZZZZZZZZZZZZ’ end of data marker. This version also declares product numbers as VARCHAR(255) and prices as DECIMAL(10,2) for which there is no a priori support.
google-chat-gemini-flash-1.5: Similar to the openai-gpt-4o output, except the inner join is explicit and the code does not include a ‘quantity’ field. Of the three outputs, this is the cleanest version.

Translation may require the creation of entities not present in the original, such as table names, and changes to formatting, such as dashes to underscores, to be more idiomatic. However, ideally, insertion of content that relies on background knowledge rather than the source itself should be annotated.

Problem 2 - Treat Duplicate Inventory as Errors

Problem 2 extends Problem 1 by allowing duplicate inventory records. If a duplicate record is found, the record is sent to an error file. We performed some limited testing using Problem 2, but found it provided limited additional insight versus Problem 1 and Problem 3, so are ignoring it in the interests of space.

Problem 3 - Join Inventory at Multiple Sites with Prices

Problem 3 extends Problem 1 by allowing duplicate inventory records. Unlike in Problem 2, duplicate inventory records are meaningful and quantities are summed across records. (Conceptually, this represents inventory at multiple sites and we are calculating the global quantity.) Problem 3 also adds an ‘extended price,’ to the output. Within the manual, the use of X-1 is to demonstrate how a program may call out to non-FLOW-MATIC code even though the logic (multiplying the quantity by the price) could have been accomplished within the standard code.

(0) INPUT INVENTORY FILE-A PRICE FILE-B ; OUTPUT PRICED-INV FILE-C UNPRICED-INV FILE-D ; HSP D .

(1) COMPARE PRODUCT-NO (A) WITH PRODUCT-NO (B) ; IF GREATER GO TO OPERATION 21 ; IF EQUAL GO TO OPERATION 5 ; OTHERWISE GO TO OPERATION 2 .

(2) TRANSFER A TO D .

(3) SET OPERATION 13 TO GO TO OPERATION 18 .

(4) JUMP TO OPERATION 8 .

(5) TRANSFER A TO C .

(6) MOVE UNIT-PRICE (B) TO UNIT-PRICE (C) .

(7) SET OPERATION 13 TO GO TO OPERATION 14 .

(8) MOVE PRODUCT-NO (A) TO PRODUCT-NO (W) ; QUANTITY (A) TO QUANTITY (W) .

(9) READ-ITEM A ; IF END OF DATA GO TO OPERATION 23 .

(10) COMPARE PRODUCT-NO (A) WITH PRODUCT-NO (W) ; IF EQUAL GO TO OPERATION 11 ; OTHERWISE GO TO OPERATION 13 .

(11) X-1 ADD QUANTITY (A) TO STORED QUANTITY (W) .

(12) JUMP TO OPERATION 9 .

(13) JUMP TO OPERATION 14 .

(14) MOVE QUANTITY (W) TO QUANTITY (C) .

(15) X-1 COMPUTE EXTENDED PRICE AND INSERT IN C ITEM

(16) WRITE-ITEM C ,

(17) JUMP TO OPERATION 1 .

(18) MOVE QUANTITY (W) TO QUANTITY (D)

(19) WRITE-ITEM D .

(20) JUMP TO OPERATION 17 .

(21) READ-ITEM B ; IF END OF DATA GO TO OPERATION 1 .

(22) JUMP TO OPERATION 1 .

(23) EXECUTE OPERATION 13 THROUGH OPERATION 17 .

(24) TEST PRODUCT-NO (8) AGAINST ZZZZZZZZZZZZ ; IF EQUAL GO TO OPERATION 26 ; OTHERWISE GO TO OPERATION 25 .

(25) REWIND B .

(26) CLOSE-OUT FILES C , D .

(27) STOP . (END)

For reference to the reader, the first X-1 code (label 11) is defined as:

B0W001
A-A001
C0W001

These instructions perform, using relative addresses:

Load the temporary value (W, offset 1) into register
Add the value in file A (offset 1) to the register
Copy the accumulated value to the temporary value (W, offset 1)

The second X-1 code (label 15) is defined as:

L0C001
P0C002
J0C003

Load into the L register from file C at offset 1
Multiply (product) the L register and the contents from file C at offset 2, storing the result in A and X registers
Store the result into file C (offset 3)

The layout of the data and the definitions of the extended functions are located in ancillary files, rather than the source code itself. Thus, unless the LLM cheated by reading the manual, “intuiting” the computation of an extended price is based on background knowledge of the likely implementation.

LLM Summaries

Similar to Problem 1, each model produced a detailed explanation covering each line. However, unlike Problem 1, their summaries of Problem 3 lack detail. Two of the explanations (openai-gpt-4o and google-chat-gemini-pro-1.5) simply describe the program as doing something with inventory and pricing, lacking any mention of aggregating data between inventory records or how joins are implemented. The third mentions the aggregation of inventory but drops references to pricing data.

openai-gpt-4o: “The code provided appears to be using a structured way to process inventory and price files.”
google-chat-gemini-pro-1.5: “This code snippet represents a pseudocode or a simplified representation of a program logic, likely for processing inventory and pricing data.”
google-chat-gemini-flash-1.5: “Operations 8-12 form a loop that reads inventory items, and aggregates quantities for each unique product number by storing the running total quantity in temporary storage (W). When a different product number is encountered, the loop breaks, and the aggregated quantity will be processed and written to the appropriate output file. This technique efficiently combines quantities for the same product before writing to the output file, avoiding redundant entries.”

None of the models provide the same quality of summary as they did for Problem 1.

Translation into SQL

Similar to the task in Problem 1, each model was asked to translate the code into SQL. Our observations:

openai-gpt-4o: Accumulates inventory across multiple inventory records. Treats extended price as the product of inventory and unit price. Uses inner joins and left joins to differentiate the presence of pricing data. Uses common table expressions.
google-chat-gemini-pro-1.5: Very similar to openai-gpt-4o.
google-chat-gemini-flash-1.5: Very similar to both except it does not use common table expressions.

While the different models generate different names for tables and vary slightly in their use of SQL, they all use the same joins and model the data in the same way. We consider the output to be high-quality.

Robustness to Source Code Corruption

An LLM is not a compiler. While a compiler is expected to enforce the rules of a language and enforce consistency within the source code, an LLM matches the source code to a probabilistic model. Since historical source code is unlikely to come in a pristine digital form but rather a decayed backup or print-out, resistance to random corruptions is a beneficial quality. Furthermore, if we desire to triage or quickly check the contents of an archive, we do not need the power of a compiler.

To simulate corruption of the source code, we created copies of the originals with a number of edits applied. The nature of the edit was based on our experiences correcting the programs from the PDF. Thus, the errors simulate optical character recognition errors rather than bit flips. The script chose edits randomly from:

Changing zero (0) characters to O characters
Change 1 to I
Change 8 to B
Change ; to :
Remove a space (e.g. A B to AB)
Add a space (e.g. AB to A B)

As the number of edits increased, eventually all the characters were swapped and the code was simply adding and removing spaces in random locations. The number of edits varied from 81 (approximately 10% of the total characters), to 243, and 486 (approximately 50%).

For this exercise, we focused on a single AI model, openai-gpt-4o, and a single original source program, Problem 1. We chose to focus on openai-gpt-4o for convenience.

LLM Summaries

When asked to explain three corruptions of Problem 1, each with 81 edits, the model’s summaries stated the input files were inventory and pricing data in each case. In only one case does the summary state the output has priced and unpriced categories. In the other two cases, the summary merely describes that sequential or batch processing is in play.

In contrast, this model provided the priced and unpriced insight for the original source code, as did the other two AI models. Similar to performance from Problem 1 to Problem 3, corruptions or other slight increases in functionality led to a drop in the quality of output.

Translation into SQL

Across six versions of Problem 1, three with 243 corruptions and three with 486 corruptions, the model produced the same SQL output consistently:

-- Step 1: Identify and insert unpriced inventory items
INSERT INTO unpriced_inventory (product_no, other_columns)
SELECT i.product_no, i.other_columns
FROM inventory i
LEFT JOIN price p ON i.product_no = p.product_no
WHERE p.product_no IS NULL;

-- Step 2: Identify and insert priced inventory items
INSERT INTO priced_inventory (product_no, unit_price, other_columns)
SELECT i.product_no, p.unit_price, i.other_columns
FROM inventory i
INNER JOIN price p ON i.product_no = p.product_no;

If we contrast this to the model’s output with the original file:

-- SQL for Priced Inventory (FILE C)
INSERT INTO PRICED_INV_TABLE_C (product_no, quantity, unit_price)
SELECT a.product_no, a.quantity, b.unit_price
FROM INVENTORY_TABLE_A a
JOIN PRICE_TABLE_B b ON a.product_no = b.product_no;

-- SQL for Unpriced Inventory (FILE D)
INSERT INTO UNPRICED_INV_TABLE_D (product_no, quantity)
SELECT a.product_no, a.quantity
FROM INVENTORY_TABLE_A a
LEFT JOIN PRICE_TABLE_B b ON a.product_no = b.product_no
WHERE b.product_no IS NULL;

Apart from the changes in table names and order of outputs, the output from the corrupted sources includes a other_columns column, while the original has a quantity column. Neither quantity nor other_columns exists in the original FLOW-MATIC source code. As a clue to the developer that the desired record may include additional fields, other_columns is more explicit than quantity, although quantity is a believable column for an inventory table.

Using Problem 3 as the base, in six cases with the same variations of edits the model also produces consistent output. This output includes an other_columns field but otherwise does not add additional columns. This version lacks Problem 3’s handling of multiple inventory records, thus marking a regression in functionality.

Conclusion

The LLM models appear to be unreliable as a tool to quickly summarize programs or translate them into other forms while retaining key aspects of the original logic. While their output can be highly consistent in the presence of random errors, their ability to capture and communicate key aspects of the programs can quickly decay to uselessness.

Thus, researchers should be wary about using LLMs as a tool for researching historical source code. Technologically, improvements in models and workflows might render LLMs a reliable tool.

We believe future quantitative studies should include:

Measures of quality for the output, including key aspects included, contextual aspects not referenced but included, and aspects added without utility
Measures of the narrative signal to noise ratio
Rendering to a “canonical” form to simplify bulk analysis

References

FLOW-MATIC Brochure Courtesy of Computer History Museum.

(Rand 1958) Remington Rand Univac. 1958. UNIVAC FLOW-MATIC Programming System. archive.org

(Sammet 1969) Sammet, Jean E. 1969. Programming Languages: History and Fundamentals. Englewood Cliffs, N.J.: Prentice-Hall. archive.org