COBOL's Map Reduce
COBOL is for Big Data. Well, sort of. Awhile back, I noticed that the COBOL SORT verb was overpowered. Rather than sorting an array of items or even sorting a file, it included a generalized ability to stream in arbitrary inputs —
() => Stream[T] —, where T is a key/value pair, and process the outputs in order —
SortedStream[T] => (). This power is useful if you are writing a map-reduce program, but excessive for sorting. So, let’s look at how this was implemented, why it was part of the design, and what prevented COBOL from jump-starting the distributed map-reduce revolution.
Many real-world tasks involved mapping the input to a key/value pair, and then reducing sequences of values that share the same key to some useful value. The Census, for example, involves transforming a census record into the pair state/number of people per record, and then summing the number of people per record into a number of people per state. Before automation, census workers performed this task in a distributed fashion by tallying small stacks of records themselves and then combining their tallies to produce the final results.
When Google’s “MapReduce: Simplified Data Processing on Large Clusters” paper was released, I was surprised by the attention it was getting. In my scientific computing class we had covered the model in the first week and I think we implemented a version in MPI as one of the initial homework assignments. What I wasn’t realizing was that the technology was greatly democratizing the programming model; there was a lot of value in the 80% solution.
COBOL’s design for the SORT was meant for map-reduce applications. Although the specification did not state it explicitly, COBOL textbooks of the time used map-reduce examples, just not using the term map-reduce specifically.
COBOL’s SORT Verb
I’ve written a census-like application to demonstrate the map-reduce nature of SORT. COBOL has two kinds of SORT: file sort and table sort. Table sort is straight-forward and acts like sorting an array. File Sort supports several variants, but
INPUT PROCEDURE and
OUTPUT PROCEDURE are the two options that provide the necessary power.
On COBOL Verisimilitude: The SORT VERB was added to COBOL 61, so I’ve tried to keep my discussion specific to COBOL of that era. However, my code is in ‘free style’ so the code’s indentation is freer than allowed at the time. I use mixed-case to make the code easier to read, although it is against the standard (the specification’s character set did not include lower case) and a-historical (computer systems used 6-bit characters). I’m also using PERFORM and in-line code, which weren’t standardized until 1985. I’ve verified the code is at least 1985 compliant by using the
-free -std=cobol85flags for GnuCobol 2.2.0.
The entire program is around 90 lines. (The word frequency C++ program is around 50 lines without comments and whitespace.) About a third of the lines define the program’s I/O and data structures, including file schemas. COBOL unifies the definition of data with the declaration of storage for that data, so you’ll see data moved into the definitions.
ORGANIZATION LINE SEQUENTIAL means the file uses some character sequence to delimit records (lines); since I ran this on a Unix system, it’s the newline character.
PIC A(x) indicates a alphanumeric field that is x characters in length;
9(x) is a numeric field with x digits (including sign), and
Z(x) means zeros are replaced by spaces in the field (otherwise, numbers will be padded with zeros). File names are hard-coded for simplicity.
IDENTIFICATION DIVISION. PROGRAM-ID. MapReduce. ENVIRONMENT DIVISION. INPUT-OUTPUT SECTION. FILE-CONTROL. SELECT input-file ASSIGN TO "census.txt" ORGANIZATION LINE SEQUENTIAL. SELECT sort-work-file ASSIGN TO "sort.txt" ORGANIZATION LINE SEQUENTIAL. SELECT reduce-output-file ASSIGN TO "report.txt" ORGANIZATION LINE SEQUENTIAL. DATA DIVISION. FILE SECTION. FD input-file. 01 census-record. 05 household PIC A(8). 05 region1 PIC A(2). 05 heads PIC 9(3). SD sort-work-file. 01 reduce-input. 05 region2 PIC A(2). 05 heads-by-household PIC 9(3). FD reduce-output-file. 01 population-by-region-output. 05 region3 PIC A(2). 05 heads-by-region PIC Z(4). WORKING-STORAGE SECTION. 01 ws-census-record. 05 ws-household PIC A(8). 05 ws-region1 PIC A(2). 05 ws-heads PIC 9(3). 01 ws-eof PIC A(1). 01 last-region-seen PIC A(2). 01 reduce-eof PIC A(1). 01 region-tally PIC 9(4).
SORT is the driver for the map-reduce process.
SORT is told the location of a ‘work file’ to be used for its own processing; the underlying SORT machinery does not have to use it, but can perform the work in-memory if the work is small enough. Depending on the type of sort work file, the machinery may use different algorithms. The source of data for
map-function, which will
RELEASE records (which must conform to the schema defined in
map-function is responsible for handling its own I/O. Similarly,
reduce-function accepts records one-by-one via
RETURN from the work file and writes to a report file a record once it has processed all of a single key. The end-of-region procedure is a helper for the reduce-function.
PROCEDURE DIVISION. main SECTION. DISPLAY "Map Reduce in COBOL" SORT sort-work-file ON ASCENDING KEY region2 INPUT PROCEDURE IS map-function OUTPUT PROCEDURE IS reduce-function. STOP RUN . map-function SECTION. OPEN INPUT input-file. PERFORM UNTIL ws-eof='Y' READ input-file INTO ws-census-record AT END MOVE 'Y' TO ws-eof NOT AT END MOVE ws-region1 TO region2 MOVE ws-heads TO heads-by-household RELEASE reduce-input END-READ END-PERFORM CLOSE input-file . reduce-function SECTION. OPEN OUTPUT reduce-output-file. MOVE ' ' TO last-region-seen. PERFORM UNTIL reduce-eof='Y' RETURN sort-work-file AT END MOVE 'Y' TO reduce-eof NOT AT END IF last-region-seen IS NOT EQUAL TO ' ' AND last-region-seen IS NOT EQUAL TO region2 PERFORM end-of-region END-IF MOVE region2 TO last-region-seen MOVE region2 TO region3 ADD heads-by-household TO region-tally END-RETURN END-PERFORM. PERFORM end-of-region. CLOSE reduce-output-file . end-of-region SECTION. MOVE region-tally TO heads-by-region. WRITE population-by-region-output BEFORE ADVANCING 1 LINE END-WRITE. MOVE ZERO TO region-tally .
COBOL was designed for batch, serial processing of records. Given the limitations of the computers of the day, almost all datasets processed would exceed the core storage available (e.g. the IBM 360/50, released in 1965, only had between 64 and 512kB of memory). Processing had to be performed in a streaming fashion.
Since the programmer might not know the number of keys beforehand, sorting the data beforehand allowed the program to handle an arbitrary number of keys with minimal overhead since it could process all records with the same key in a single pass. COBOL lacked an associative array mechanism at the time, so even if you knew the keys beforehand, it may be painful to program out-of-order processing. Data could be sorted by a different program (or a device) beforehand, but there wasn’t a good way to represent “this program needs sorted data”. Since the programmer and operator tended to be separate roles, removing this potential error mode was desirable as computer time was very expensive.
Finally, writing a sorting algorithm in COBOL was pretty painful given the limited array (table) functionality and the inability to create reusable user functions.
Adding the SORT verb was such a convenience ACM published an article about it.
Delegation of Concerns and Distributed Processing
Apart from the fact software engineers are trained to loathe COBOL, why wasn’t it a model for later big data systems?
- COBOL and RDBMS became closely related
- COBOL lacked traction in small systems
- INPUT and OUTPUT PROCEDUREs design prevented delegation of concerns
First, COBOL was an innovator in schematized hierarchical data and its strength was always in processing records rather than streams of bytes. Relational databases were the natural stores of this data as it moved away from tapes and punched cards. COBOL could delegate the computation of data to the database. After all,
SELECT region, SUM(heads) FROM census GROUP BY region ORDER BY region
is a lot shorter than the program above. An organization, faced with lots of data, would be more likely to scale the database rather than the programs accessing the database which were probably just handling business logic and reporting. This is similar to DeWitt and Stonebraker’s early criticism of MapReduce, which they felt lacked novelty versus prior work in RDMSs and was a backwards step away from schematized data.
Secondly, although COBOL was available on small systems such as personal computers, its strength was in mainframes. Mainframes specialize in high-throughput, multiplexed I/O which is accomplished by being very “wide” in processing. This leads to vertical scaling, as opposed to horizontal scaling, as IBM continues to advertise. Culturally and economically (prices are set per CPU), there wasn’t an incentive to change designs to spread work over a multitude of machines.
Third, the protocols between the SORT machinery and the INPUT and OUTPUT procedures prevent (easy) parallelization. The INPUT procedure is entirely responsible for managing its I/O and thus the SORT machinery has no knowledge of what files are being read. An implementation of the INPUT procedure could distribute work itself, although that would be onerous. If all the records were distributed and then needed to be RELEASED on a single node (for the sort), this would create a huge bottleneck. The OUTPUT procedure is guaranteed to be RETURNed records in-order which would lead to high levels of idleness for parallel workers. The OUTPUT procedure could broadcast the values to distribute work, but again that would be an onerous responsibility to put on a procedure versus having the system provide that functionality.
However, these issues can be overcome. SORT already has variants that are given the list of input and output files and knows the keys used for the sort order, which could be assumed to be the key for distributed processing. The input and output procedures could be replaced by references to user defined intrinsic functions (added to the language in 1989), although a version that used procedures should be possible by defining a proper protocol. These functions/procedures would then perform just the mapping and reducing (e.g.
T => U and
(U, W) => W), without any direct involvement in I/O nor be provided guarantees on the order of invocation. A sort data (SD) variant could be added to specifically support the distributed sorting application.
Given the economics and culture, I don’t see a plausible alternate history where distributed map-reduce COBOL programs rose to prominence in the 1990s. However, the map-reduce architecture was quite close to one that supported distributed processing.