Quality of early 1970s approximate string matching algorithms

2022-06-16

Approximate string matching algorithms (ASMA) determine if two strings are the same, “close enough”, or are distinct. Spelling error detection and correction were early applications of these algorithms, and today ASMAs (or similar techniques) are used in natural language understanding applications. Hundreds of papers have been written on the subject (1980 survey, 2001 survey), but this post is focused on the subset of algorithms that were considered commercially viable for the field of computer-assisted instruction (CAI), an early hardware/software-as-a-service business.

Background

Between 1985 and 1986, John C. Nesbit wrote a pair of articles¹² for the Journal of Computer-Based Instruction examining the accuracy of ASMAs. His interest was using the algorithms for student response matching in computer-aided instruction. A student response is a typed answer to a question posed by an educational program. For example, the question “who was the second president of the United States?” might wish to accept variants such as:

John Adams
John A Adams
John A. Adams
President Adams
Pres. Adams
Adams

If the student responded with “Jon Adams” (misspelling John), the developer may wish that answer to be accepted or to coax the student to fix their answer rather than rejecting it outright. Similarly, if the student answered “Thomas Jefferson” (or variants thereof), the developer may want to reject the answer with a suggestive message such as “Jefferson was the third president”. As you likely intuiting, the sheer range of variants possible would be very tedious for the developer unless the CAI system added labor-saving techniques.

Some CAI systems would go beyond simple spell-correction and provide response normalization features. For instance, PLATO would treat “three” and “3” as equivalent, as well as “1/2”, “0.5”, “2/4”, and “50/100”. There were also facilities to treat mathematical expressions such as “100 + (a * b)/(c + d)” as equivalent to all the other possible forms. Given the limited CPU and memory available (and the fact that response matching was a very common operation), these features used a number of tricks and heuristics, rather than implementing a computer algebra system.

Nesbit selected a group of ten algorithms that were either being used for CAI applications or had performed well in previous studies. He implemented and evaluated all of the algorithms on a DEC VAX-11/780 except for PLATO, which ran on its native CDC CYBER 170/720.

There were two operative theories for how words became misspelled. The first theory was that words were often misspelled such that they were correct phonetically. The second theory is that misspellings come from random errors that drop, insert, substitutions, and sometimes transpose characters. Blair, PLANIT, Soundex, and Symonds were based on the phonetic theory. Alberga, EDIST1, EDIST2, and Damerau were based on edit distance. Damerau-Symonds combined the two theories. PLATO projected words into a bitstring via a variety of feature extraction rules.

Results

Table 1 is based on the four figures from (Nesbit 1986). We extracted the Type I and Type II errors from the figures via visual examination, and based on data about the four datasets (Table 2), computed a F1 score. For the algorithms Alberga, EDIST1, and EDIST2, since the figures included a range of values, we took the value closest to the origin. Table 3 summarizes the results for each algorithm using the harmonic mean of the F1 score from each of the four datasets.

Table 1: Qualitative measurements from (Nesbit 1986) plus F1 Score

Dataset	Algorithm	Type I Error: FP %	Type II Error: FN %	F1
Blair	Alberga	6	6	0.94
	Blair	24	4	0.87
	Damerau	27	0	0.86
	Damerau-Symonds	6	4	0.95
	EDIST1	2	3	0.98
	EDIST2	2	2	0.98
	PLANIT	8	19	0.87
	PLATO	6	2	0.96
	Soundex	6	21	0.86
	Symonds	17	2	0.91
Damerau	Alberga	9	4	0.88
	Blair	29	3	0.66
	Damerau	18	0	0.79
	Damerau-Symonds	9	4	0.88
	EDIST1	2	2	0.97
	EDIST2	2	3	0.96
	PLANIT	30	21	0.60
	PLATO	12	2	0.85
	Soundex	21	14	0.71
	Symonds	46	2	0.49
Masters	Alberga	11	8	0.93
	Blair	34	4	0.86
	Damerau	28	0	0.90
	Damerau-Symonds	14	5	0.93
	EDIST1	7	4	0.96
	EDIST2	5	4	0.97
	PLANIT	19	18	0.85
	PLATO	13	4	0.94
	Soundex	12	21	0.85
	Symonds	22	3	0.91
Nesbit	Alberga	29	21	0.81
	Blair	59	2	0.74
	Damerau	47	2	0.82
	Damerau-Symonds	20	17	0.86
	EDIST1	21	17	0.86
	EDIST2	14	12	0.90
	PLANIT	17	37	0.75
	PLATO	27	8	0.88
	Soundex	17	20	0.85
	Symonds	27	7	0.89

Table 2: Datasets used in (Nesbit 1986)

Dataset	Words	Mispellings	Total	Description
Blair	117	117	234	Common misspellings
Damerau	41	44	85	Newspaper errors
Masters	179	320	499	Grades 8, 12, 16
Nesbit	213	524	737	Grades 2-6

Table 3: Harmonic Mean of F1 Scores

Algorithm	Harmonic Mean	Publication Year
Alberga	0.89	1967
Blair	0.77	1960
Damerau	0.84	1964
Damerau-Symonds	0.90	1970
EDIST1	0.94	1974
EDIST2	0.95	1983
PLANIT	0.75	1966
PLATO	0.91	1972 / 1985³
Soundex	0.81	1930⁴
Symonds	0.75	1970

In descending order, the best performing algorithms by quality were EDIST2, EDIST1, PLATO, Damerau-Symonds, Alberga, Damerau, Soundex, Blair, and Symonds.

Conclusion

By the early 1970s, edit distance based algorithms were showing their qualitative superiority to phonetically based algorithms. Furthermore, edit distance based algorithms were more amenable to multiple languages, while the phonetic algorithms evaluated were all tuned for English text. (However, all the datasets used English examples.) However, edit distance algorithms were more costly to compute than the others in the sample (except Damerau, which restricted the error count to one). EDIST1 and EDIST2 both required computing the edit distance, which cost \(O(mn)\) time and \(O(mn)\) space where m and n were the length of the two strings, although in 1980 a solution with \(O(m)\) space was found (Navarro 2001). Solutions with \(O(kn)\) time were found in the 80s and 90s, where k was the error threshold. Today, edit distance is the standard technique for handling errors in information retrieval, conversational analytics, and biological informatics.

The PLATO algorithm stands apart in the group due to its excellent performance, both in terms of quality and computational speed. Its design was inspired by a numerical taxonomy approach and uses techniques from cluster analysis. Due to the novelty and effectiveness of the approach, we will discuss details of the implementation in a future production.

Nesbit, John C. “Approximate string matching in response analysis.” Journal of Computer-Based Instruction 12.3 (1985): 71-75. archive.org ↩︎
Nesbit, John C. “The accuracy of approximate string matching algorithms.” Journal of Computer Based Instruction 13.3 (1986): 80-83. archive.org ↩︎
The PLATO matching algorithm evolved over time. 1972 was the date of the first publication describing its design in the TUTOR language, although some version of the algorithm may date back as far as 1965. The algorithm that was tested by Nesbit was based on the 1985/1986 version. ↩︎
Soundex was first patented in 1918. The U.S. Census uses a version called “American Soundex” which was defined in 1930. ↩︎