While you read along, why not try evaluating programs through our interactive interpreter? As usual, the source code for all the work done in this post is available at this blog’s Google Code page. For more information on Racket, see the Quick Introduction to Racket or the more extensive Racket Guide.
Our Very First Language
Recall from our introduction that our theoretical design cycle for new features in Chai is as follows:
- Add new syntax forms
- Decide how those forms should be interpreted
So let us implement the most basic language we can imagine, with just two language forms: one for numbers and one for adding numbers. By starting with something so simple, we will have a skeleton of a parser and interpreter to which we may add new features incrementally, and we won’t get bogged down in the details of more complicated ideas.
So let’s decide what programs will look like!
To rigorously define what expressions will look like, we use a nice standard for descriptions called Extended Backus-Naur Form (EBNF). For those familiar with the theory of computation, this is very closely related to the notations for defining context-free grammars. EBNF notation is quite easy to learn by imitation, and we will teach it by example. Our first attempt at a syntax tree might look something like this:
expr = number | (+ number number)
Here the pipe symbol, |, should be mentally replaced with the word “or.” For now, every program will consist of a single “expr,” which is short for expression. In words, an expression can either be a number, or a sum of two numbers. For the sake of this language, we allow “number” to be anything that Racket considers a number (we will get to this soon), and we ignore additional whitespace in between these tokens.
Finally, one notices the odd placement of parentheses and the plus symbol. The parentheses represent an application of a binary operation, and the position of the plus is a notational standard called Polish notation. These aspects of the syntax may seem odd at first, but they happen to make our lives much simpler by eliminating all the hard parts of parsing expressions. And, of course, every logician knows the advantage of Polish notation: there is absolutely no ambiguity in how to read an expression. These subtle details rear their ugly heads in compiler design, and we may come back to them in the distant future. But for now we ask that the reader accept it, and get used to it, because it turns out the entire Racket language (and Lisp family of languages) is based on this notation.
Here are a three examples of well-formed programs we could write in this early version of Chai:
7 (+ 14 9) (+ 4 10 )
Unfortunately, there are some very simple programs we cannot yet write in Chai. For instance, the following program does not fit into our specification:
(+ 1 (+ 2 3))
Rigorously speaking, this form is not included in the above EBNF program syntax, because the arguments to a sum may only be numbers, and not other expressions. With the obvious self-referential modification, we fix this.
expr = number | (+ expr expr)
Now we may chain additions indefinitely. With the appropriate extension, we could easily extend this to include all binary operations we desire. For instance:
binop = + | - | * | / | ^ | < | ... expr = number | (binop expr expr)
We will do this in the future, but for now, let’s translate these two syntactic forms into Racket code.
Define-Type, and Type-Case
In order to implement our syntax tree, we’d like some sort of internal representation for a language form. Of course, we could just package everything as a list of strings, but our foresight tells us this would get old quick. A structured solution is relatively straightforward; and the code itself turns out to look just like our EBNF tree, but with some additional names for things. In short: we want a datatype that represents each language form, and encapsulates the types of its arguments. In the language of compilers, such a datatype is called an abstract syntax tree (sometimes abbreviated to AST). Here is our first AST for Chai, implemented in Racket.
;; a chai expression (define-type chai-expr [num (val number?)] [sum (lhs chai-expr?) (rhs chai-expr?)])
We explain the notation for those of us not so familiar with Racket: the semicolons start a comment line, which is not part of the program. The variable names in Racket (also called identifiers) allow for many characters that other languages prohibit: hyphens, slashes, question marks, etc., are all valid to use in identifiers. So for those experienced with other languages, don’t mistakenly think these are subtraction operations! Finally, the (define-type chai-expr …) language form defines a new type called ‘chai-expr’, and the following sub-expressions are subtypes. For Java and C/C++ users, this is essentially a shorthand for a bunch of public classes which all inherit the chai-expr interface (an empty interface, that is). The define-type form was specifically designed for creating these abstract syntax trees, and it does a pretty good job of it. Here, each subtype looks like
[type-name (field-1 type-1?) (field-2 type-2?) ...]
where ‘type-name’ and ‘field-j’ are identifiers for all j, and ‘type-j?’ are functions which accept one argument, and return true if the argument is of a specific type. For instance, ‘number?’ is a function which accepts one argument and returns true if and only if the argument is a number (as defined by Racket). For user-defined types, Racket creates these ‘type?’ functions automatically. In addition, Racket creates functions for us to create new instances of these types and access their fields.
This is easier to explain with examples. For instance, I could create some objects and then access their various fields:
> (define y (num 1)) > (num-val y) 1 > (define x (sum (num 2) (num 3))) > (sum-lhs x) (num 2) > (num-val (sum-rhs x)) 3
So ‘num-val’ accesses the ‘val’ field of the ‘num’ type, and so forth for each type we create. This is fine and dandy, but most of the time we won’t know whether our given piece of data is a ‘num’ or a ‘sum’ type. Instead, we will just know it is a ‘chai-expr’ and we’ll have to figure out which subtype it corresponds to. Luckily, Racket has our back again, and provides us with the ‘type-case’ language form. We might use it like this:
> (define my-expr (num 5)) > (type-case chai-expr my-expr [num (val) (string-append "I got a num! It was " (number->string val))] [sum (lhs rhs) "I got a sum! This is rad!"] [else "Getting here is an existential crisis."]) "I got a num! It was 5"
The first argument (for us, ‘chai-expr’) describes which type to inspect, the second is the argument of that type (here, a ‘num’ object), and the subsequent [bracketed] clauses provide cases for each subtype one wants to consider, optionally with an ‘else’ clause which we included superfluously. The Racket interpreter determines which clause is appropriate (here, the ‘num’ clause), and binds the actual arguments of the input (in this case, 5) to the parenthetic field identifiers (in this case, (val)), so that one may use them in the following expression (here, the call to string-append).
As it turns out, define-type and type-case is all the machinery we need to get things working. But before we continue, I should mention that these two functions are not native to Racket. In fact, they come from a package called plai, and they were created using Racket’s nice system for macros. In other words, in Racket one can write new language forms for Racket! We won’t cover those here, but any programming enthusiasts out there might have a lot of fun with exploring the possibilities therein.
Once we can translate an expression into branches of our abstract syntax tree, we will find that writing the actual interpreter is extremely easy. So let’s do the hard part first: parsing.
Of course, our choice of Racket-like syntax was in part because parsing such a syntax is relatively easy. In particular, Racket has a nice function that takes a string and converts it into a list of symbols, strings, and numbers (and other certain primitive data types). For instance,
> (read (open-input-string "(hello (there! 7 (+ 2 4)))")) '(hello (there! 7 (+ 2 4)))
Here the leading quote is shorthand for Racket’s “quote” function. The official name for a quoted expression (and what “read” outputs) is s-expression. An s-expression is either a simple value (number, string of characters, boolean, or symbol), or a list of s-expressions. All words within a quoted expression are interpreted as symbols For more on quote, see the Racket Guide’s section on it.
So with an appropriate call to read, we can take a user’s input string and get an organized list of values. In our source code, we implement a more complex read function, which is available in the “chai-utils.rkt” source file on this blog’s Google Code page. With some additional checks to ensure there is exactly one expression to be read, we call this function “read/expect-single.” Its details are decidedly uninteresting, but if the reader is curious, one may find its internals displayed in the aforementioned source file. Similarly, we wrote a function called “expr->string” which accepts an s-expression and prints it out as a string.
As the reader might anticipate, once we have our inputs in the form of a list, parsing becomes a recursive cake-walk. Specifically, we would start with a shell of a function:
;; parse-expr: s-expr -> chai-expr ;; parse an s-expression into a chai expression, or throw an error ;; if it is not well-formed (define (parse-expr input-expr) (cond [(number? input-expr) <do something>] [(list? input-expr) <do something>] [else (error (string-append "parse: unsupported language form: " (expr->string input-expr)))]))
So parse-expr accepts an s-expression, and spits out a well-formed chai-expression, defaulting to an error. Here we use the “cond” language form, which is the Racket analogue to “switch” in C/C++/Java. For everyone else, it allows us to string together a number of conditions without writing cumbersomely many nested if/then/else statements. In each branch of the cond, we narrow down what our possible expression could be. If it satisfies number?, then our expression is a num, and if it satisfies list?, it is likely a sum. From here filling in the <do something> parts is easy: we simply construct the appropriate types:
;; parse-expr: s-expr -> chai-expr ;; parse an s-expression into a chai expression, or throw an error ;; if it is not well-formed (define (parse-expr input-expr) (cond [(number? input-expr) (num input-expr)] [(list? input-expr) (sum (parse-expr (second input-expr)) (parse-expr (third input-expr)))] [else (error (string-append "parse: unsupported language form: " (expr->string input-expr)))]))
Here, “second” and “third” extract the second and third elements of the list, and parse-expr recursively evaluates the arguments (as we noted above when we said a sum had the syntax (+ expr expr)). However, it appears we missed something big: what if the list doesn’t have three elements in it? Someone could try to run the program “(square 7),” expecting a result of 49. This is certainly not a sum, but as of now our parser doesn’t make any distinctions. So we need to add a few more checks. Here is the complete parser, with all appropriate conditions checked and combined using the “and” function:
;; parse-expr: s-expr -> chai-expr ;; parse an s-expression into a chai expression, or throw an error ;; if it is not well-formed (define (parse-expr input-expr) (cond [(number? input-expr) (num input-expr)] [(and (list? input-expr) (eq? (first input-expr) '+) (eq? 3 (length input-expr))) (sum (parse-expr (second input-expr)) (parse-expr (third input-expr)))] [else (error (string-append "parse: unsupported language form: " (expr->string input-expr)))]))
We check to make sure the first thing in the list is the symbol ‘+, and that the list has exactly three elements. Then we can be sure that the user meant to put in a sum.
The parse-expr function, along with our following “interp” function and the tests for both functions, will be stored in the “chai-basic.rkt” source code file on this blog’s Google Code page.
At this point, we’ve parsed our expressions into the chai-expr datatype, and so now a simple application of type-case is all we need to interpret them. Indeed, the interp function practically writes itself:
;; interp: chai-expr -> number ;; interpret a chai-expression (define (interp input-expr) (type-case chai-expr input-expr [num (val) val] [sum (lhs rhs) (+ (interp lhs) (interp rhs))]))
We don’t need to do any more conditional checking, because we know that anything fed to interp is well-formed. Later, specifically once we add variable references, interp will become much more interesting.
Here the plus function is Racket’s plus. Of course, it seems a bit silly to use + to interpret +, but remember that the point of this series is not to write a language from the ground up, but to get things rolling as quickly as possible, so that we may analyze the more interesting features of programming language semantics. Simply put, arithmetic is boring. We include it simply for familiarity, and because it makes good fodder for writing test cases to ensure our interpreter acts as it should.
Finally, we add one additional function which executes the entire parse/interp chain:
;; evaluate/chai: string -> any ;; perform entire the parse/interp chain (define (evaluate/chai input) (interp (parse input)))
And (at the top of our source file), we “provide” the function so that other Racket programs may access it, specifically with the command (require “chai-basic.rkt”). All of our interpreters in this series will adhere to the same externally-facing interface.
So there we have it! If the reader downloads the source files, he can interpret expressions through Racket’s interactive interpreter. Additionally, the reader is invited to visit our website, where we have set up a program to receive and evaluate chai-programs through the internet. In the future we will store all of our online interpreters here, so one will be able to access Chai at all stages of its development.
Next time, we will finish off a full set of arithmetic operations, and start looking at variables. Until then!