Rosetta Code

Rosetta Code is a programming chrestomathy wiki, that is, it is a site with lots of examples of tasks completed in multiple programming languages. The site has been around since 2007 and now has 1,100+ tasks and 100,000+ code submissions over 900+ languages. To help other researchers, I’m publishing an export of the code samples as a sqlite database via DBHub.io and the source code via Gitlab.

Chrestomathy?

The OED defines chrestomathy as:

A collection of choice passages from an author or authors, esp. one compiled to assist in the acquirement of a language.

For example, if you were interested in how languages differed in their APIs for renaming files, you might find the rename a file task and look at some examples:

(rename-file "input.txt" "output.txt")
(rename-file "docs" "mydocs")
(rename-file "/input.txt" "/output.txt")
(rename-file "/docs" "/mydocs")
let () =
  Sys.rename "input.txt" "output.txt";
  Sys.rename "docs" "mydocs";
  Sys.rename "/input.txt" "/output.txt";
  Sys.rename "/docs" "/mydocs";
'input.txt' asFilename renameTo: 'output.txt'.
'docs' asFilename renameTo: 'mydocs'.
'/input.txt' asFilename renameTo: '/output.txt'.
'/docs' asFilename renameTo: '/mydocs'

These three examples (in Lisp, OCaml, and Smalltalk) demonstrate similarities and differences in how the languages treat the REPL/program entry point, namespaces and modules for functions, string definition, and statement parsing.

Licensing

The Rosetta Code content is licensed under the GNU Free Documentation License 1.2. Based on my reading of the terms, that would cover the contents of the database as well.

The script is released under the MIT License.

Database Overview

The database is hosted by DBHub.io and is available here. You can browse and analyze the data within the site or download the database (282 mb).

DBHub.io is a cloud storage and analytics solution for SQLite databases provided by the same folks as DB Browser for SQLite.

I also have an approximately ten percent sample of the data hosted here. The sample is 27 mb.

The database has three tables: task, language, and submission.

The task describes what the program should do. The content field is close to the raw data if you need to look at the source data.

The language is a collection of language keys used in the submissions or code blocks. Some language names have been normalized for compatibility; C++ and C# are cpp and csharp, for example. The keys are drawn directly from the Rosetta Code site.

The submission is a collection of code samples. A given submission will be linked to a specific task and language, and may include a description (drawn from the preface) and output to provide extra context. The database only contains code samples that were explicitly marked as code; the loose nature of the wiki means that some code samples are embedded in explanatory text without any markup. The export script does not attempt to find these code instances.

Example: Hello World

The classic Hello World! programs are in task id 1514. To fetch the ten longest code examples for printing Hello, World, you can use the query:

SELECT language.key, code 
	FROM submission 
	JOIN language ON submission.language=language.id 
	WHERE task=1514 
	ORDER BY length(code) DESC 
	LIMIT 10

As you might imagine, the longest programs are for joke languages and very low level languages like assembly.

Script

The rosetta2sqlite script is written in Python, targeting version 3.8 or higher, with no external dependencies. The script is structured as a fairly standard extract-transform-load script and includes unit tests. The script uses data classes and type hints, both of which I’ve found to be very useful in my professional work.

The README file in the project contains more details on the database schema and input format.

Processing the full export takes slightly less than two hours.

Why no external dependencies?

There are some Python libraries that might have simplified parsing the MediaWiki text, but none passed my personal threshold for supply chain risk for this project. Research software tends to decay rapidly as it is usually secondary to the author’s purpose. Further, as an ETL script, it is even more “one-and-done” because my purpose only needs a snapshot of data, not a continual feed. Thus, maintenance (either by me or those who fork it for their own needs) is greatly simplified:

  1. The only updates are to the language and the standard library
  2. All the code is in one place
  3. Avoids bit rot from Python’s ever evolving approach to dependencies