Rosetta Code is a programming chrestomathy wiki, that is, it is a site with lots of examples of tasks completed in multiple programming languages. The site has been around since 2007 and now has 1,100+ tasks and 100,000+ code submissions over 900+ languages. To help other researchers, I’m publishing an export of the code samples as a sqlite database via DBHub.io and the source code via Gitlab.
The OED defines chrestomathy as:
A collection of choice passages from an author or authors, esp. one compiled to assist in the acquirement of a language.
For example, if you were interested in how languages differed in their APIs for renaming files, you might find the rename a file task and look at some examples:
(rename-file "input.txt" "output.txt") (rename-file "docs" "mydocs") (rename-file "/input.txt" "/output.txt") (rename-file "/docs" "/mydocs")
let () = Sys.rename "input.txt" "output.txt"; Sys.rename "docs" "mydocs"; Sys.rename "/input.txt" "/output.txt"; Sys.rename "/docs" "/mydocs";
'input.txt' asFilename renameTo: 'output.txt'. 'docs' asFilename renameTo: 'mydocs'. '/input.txt' asFilename renameTo: '/output.txt'. '/docs' asFilename renameTo: '/mydocs'
These three examples (in Lisp, OCaml, and Smalltalk) demonstrate similarities and differences in how the languages treat the REPL/program entry point, namespaces and modules for functions, string definition, and statement parsing.
The Rosetta Code content is licensed under the GNU Free Documentation License 1.2. Based on my reading of the terms, that would cover the contents of the database as well.
The script is released under the MIT License.
The database is hosted by DBHub.io and is available here. You can browse and analyze the data within the site or download the database (282 mb).
DBHub.io is a cloud storage and analytics solution for SQLite databases provided by the same folks as DB Browser for SQLite.
I also have an approximately ten percent sample of the data hosted here. The sample is 27 mb.
The database has three tables: task, language, and submission.
task describes what the program should do. The
content field is close to the raw data if you need to look at the source data.
language is a collection of language keys used in the submissions or code blocks. Some language names have been normalized for compatibility; C++ and C# are cpp and csharp, for example. The keys are drawn directly from the Rosetta Code site.
submission is a collection of code samples. A given submission will be linked to a specific task and language, and may include a description (drawn from the preface) and output to provide extra context. The database only contains code samples that were explicitly marked as code; the loose nature of the wiki means that some code samples are embedded in explanatory text without any markup. The export script does not attempt to find these code instances.
Example: Hello World
The classic Hello World! programs are in task id 1514. To fetch the ten longest code examples for printing Hello, World, you can use the query:
SELECT language.key, code FROM submission JOIN language ON submission.language=language.id WHERE task=1514 ORDER BY length(code) DESC LIMIT 10
As you might imagine, the longest programs are for joke languages and very low level languages like assembly.
The rosetta2sqlite script is written in Python, targeting version 3.8 or higher, with no external dependencies. The script is structured as a fairly standard extract-transform-load script and includes unit tests. The script uses data classes and type hints, both of which I’ve found to be very useful in my professional work.
The README file in the project contains more details on the database schema and input format.
Processing the full export takes slightly less than two hours.
Why no external dependencies?
There are some Python libraries that might have simplified parsing the MediaWiki text, but none passed my personal threshold for supply chain risk for this project. Research software tends to decay rapidly as it is usually secondary to the author’s purpose. Further, as an ETL script, it is even more “one-and-done” because my purpose only needs a snapshot of data, not a continual feed. Thus, maintenance (either by me or those who fork it for their own needs) is greatly simplified:
- The only updates are to the language and the standard library
- All the code is in one place
- Avoids bit rot from Python’s ever evolving approach to dependencies