Study of Historical Code

I’ve started studying a larger historical code base. Within this post, I want to summarize the sort of historical questions we might ask and notes on how to approach them.

Objective

My objectives for studying and writing about historical source code is to understand and communicate the:

  1. Intent and purpose of the software
  2. Design, engineering trade offs, and technical decisions
  3. Significance and influence of the code, its other forms, and how it was used
  4. Authorship, the process of development, inspirations, and why it was written

In his speech about history writing, [Knuth]’s (paraphrased) list was:

  1. Understand the process of discovery
  2. Understand the process of failure
  3. Celebrate the contributions of many cultures
  4. Telling historical stories as the best way to teach
  5. Learn how to cope with life
  6. Become more familiar with the world, and to know how science fits into the overall history of mankind

In contrast to Knuth’s list, my list is less focused on the “lives of scientists” angle, although I am similarly interested in the process of development and process of failure and recognizing sources of influence and contribution. For these kinds of studies, I am less interested in the development of particular algorithms or discoveries and more about larger scale engineering efforts, which by nature tend to be more impersonal.

There appear to be very few studies of historical code. [Charoenwet] is motivated by historical analysis rather than algorithmic analysis, however, the paper is focused on an methodological experiment using LDA rather than the source code as text. The field of archaeogaming, which focuses on using archaeological techniques on digital games and worlds, has featured papers focusing on technical methods used in games (e.g. [Aycock] with its analysis of a maze generation algorithm). Thus, as more historical sources come to light, this appears to be a wide-open field for new insights and methods.

Historical Questions

[Wardhaugh] discusses how to read historical mathematics. Paralleling that list, we can similarly analyze source code.

What does it say / do?

The source code describes the computation of some business logic, within some constraints. I suspect we will usually have more than just the source code, which can shed additional light on the code.

Older programs are likely to be batch-oriented, reading a stream of records (likely passed in via cards or tape), with variables and control logic either provided in-band or out-of-band. Later programs may be more file-oriented or interactive. These clues may inform a “potsherd”-like system enabling dating of programs.

This question is less about “can” and more about understandability for modern readers.

Who developed it?

Differing code styles may point to multiple authors, or development across time, although a consistent code style may just indicate multiple authors were working from the same guide or were similarly educated/trained.

Names of authors may be hidden as easter eggs or as magic numbers.

Typically requires research beyond the source code, but [Wardhaugh] includes a letter which obliquely informs the author’s circumstances. Source code is rarely narrative, though.

Research beyond the source code, although developers have often expressed frustrations with partners / hardware / customers in source code comments. If the source code was originally commercial, however, these comments are likely to be scrubbed.

Authorship may be afflicted with the “most famous person associated” curse, so we need to be careful in interpreting the evidence.

How was it built?

If we have access to source code control history, we can infer the timeline with considerable accuracy. If not, and we do not have external data on the development, there are also models based on lines of code (e.g. COCOMO) to provide some suggestion.

Although some code survives as printouts stored in a garage, most code is only available if there is a deliberate decision to retain and release it. Filters, such as media decay, lack of archival, and companies failing all lead to the loss of code.

Who consumed it?

Constraints of Research

For example, the Computer History Museum’s EULA.

References

[Aycock] Aycock, John, and Tara Copplestone. 2018. “Entombed: An Archaeological Examination of an Atari 2600 Game.” The Art, Science, and Engineering of Programming 3 (2). https://doi.org/10.22152/programming-journal.org/2019/3/4.

[Charoenwet] Charoenwet, Wachiraphan. 2018. “A Digital Collection Study and Framework Exploration — Applying Textual Analysis on Source Code Collection.” In 2018 3rd Digital Heritage International Congress (DigitalHERITAGE), 1–8. https://doi.org/10.1109/DigitalHeritage.2018.8810105.

[Knuth] Knuth, Donald, and Len Shustek. 2021. “Let’s Not Dumb down the History of Computer Science.” Communications of the ACM 64 (2): 33–35. https://doi.org/10.1145/3442377.

[Wardhaugh] Wardhaugh, Benjamin. 2010. How to Read Historical Mathematics. Princeton and Oxford: Princeton University Press. https://press.princeton.edu/books/hardcover/9780691140148/how-to-read-historical-mathematics.