As usual, we encourage the reader to follow along with our online interpreter for this version of Chai. In addition, the source code for this version of Chai is available on this blog’s Google Code page. So go ahead: experiment, discover, and play!
Variables, Variables, Variables!
A land without variables would be quite ugly. Runtimes would be sluggish, superfluous arithmetic expressions would litter every corner, and the architects (we coders) would be quite melancholy. Want to reuse a value or function? Forget about it. Want to inspect a randomly generated number more than once? Yeah right. Want to have code that someone else can actually read? Dream on.
Luckily, we’re on an exodus to a better place. A land of loops, first-class functions, and all sorts of other neat language forms. Let’s begin by implementing variables and variable references.
Substitution, Schmubstitution
The first naive approach to variable references would go something like this: whenever the programmer binds a variable, say “x” is bound to 7, the interpreter goes ahead and replaces every occurrence of the symbol “x” with the value 7. So a (pseudocode) program like
bind x to 7 bind y to 8 x + y
would first evaluate to
bind y to 8 7 + y
and then to
7 + 8
and finally result in 15. Of course, this is fine for simple arithmetic expressions, but what if we want to write a program like
bind x to 1 print x bind x to (x + 1) print x ...
This kind of thing happens all the time in (procedural) programs, specifically when we’re looping over a counting variable. With our current substitution method, this program would evaluate (with a grain of sense) to something counter-intuitive:
print 1 bind x to (1 + 1) print 1 ...
This is obviously not what the programmer has in mind. So what we really want with a variable is to capture the current value of a variable at the time it is referenced. In other words, we want to keep track of the scope of our variables. Plainly speaking, a variable’s scope is the part of the program where its values may be accessed. Usually, and for the remainder of this post, such “parts of programs” will be nested expressions. Note that we will go into more depth about scope in future posts, in particular the difference between dynamic scope and lexical scope, but for now the difference is moot, because our programs consist of but a single expression.
Of course there are ways to avoid conflicting substitutions, but they are more work than the solution we will arrive at anyway. So let’s skip banal direct substitution method and go straight to a more civilized discussion.
Environments
Environments will be our alternative to substitution. An environment is simply a mapping from variable names to values which is updated as a program runs and new variable bindings are introduced. In other words, as our program evaluates it has a new piece of data floating around with it. Returning to our problematic program above, it might evaluate as follows:
bind x to 1 print x bind x to (x + 1) print x
In the first step, we add “x” to the environment and give it a value of one:
[environment: x -> 1] print x bind x to (x+1) print x
When the print statement is reached, the interpreter looks up the appropriate value of $ x$ in the environment, and uses that value. After the subsequent bind statement, the evaluation looks like
[environment: x -> 2, x -> 1] print x
In the environment, we simply use the first binding of $ x$ as our replacement value. The astute reader might ask: “Why bother keeping the first binding of $ x$ around at all?” This is a reasonable question in the simple pseudocode language we have here, but once we implement our syntax form for Chai, we will find virtue in this peculiar decision. In general, the argument in favor of environments will become stronger and stronger as we add more features to Chai. So without further ado, let us design the syntax form for variable bindings and references.
New Syntax: with and ref
Variable references are easy. We simply add a new language that consists of any valid identifier (here, a symbol as judged by Racket). Our EBNF syntax tree now includes the line:
expr = ... | variable-reference
where a variable reference is a string of characters which is recognized as an identifier by Racket. Note that for our purposes we don’t actually care about these kinds of details, and the Racket specification is fairly liberal with identifiers. This means we can use all of the nice descriptive variable names like “number->string”. The specification is more explicit when we implement the additional line of code in our AST:
[ref (id symbol?)]
For binding new variables, we will make variable scope explicit. In particular, we want the programmer to be responsible for telling the interpreter where variables are in scope. This extra burden on the programmer is not so bad. In most languages, the scope of a variable is determined by a set of rules, and here we just trade memorization for explicitness and ease of writing the interpreter. So now we introduce the “with” form, which binds a single variable and describes its scope. Its EBNF:
expr = ... | (with variable-name expr expr)
The first sub-expression corresponds to the value of the new variable, and the second sub-expression is the “body” of the with. In particular, the result of the “with” expression is the result of the body expression, after adding these new bindings. In our interpreter, the body will be the only place that references to this particular variable are granted. For example,
(with x (+ 3 4) (* x x))
evaluates to 49, but the following program, while well-formed, might sensibly raise an error upon evaluation:
(with x 2 (with y 3 z))
Since “z” is a free variable. Here the environment only contains “x” and “y”, and “z” is unbound. Another language designer (say, a designer of Mathematica) might reasonably wish to run programs with free variables evaluating trivially to themselves. Unfortunately, doing this right requires a lot of overhead which we are simply not equipped to handle yet. To a mathematician, for instance, the program “(+ 1 (+ 2 z))” should simplify to “(+ 3 z)”, but our primitive expressions are not set up to allow for simplification. There’s a reason Mathematica is so expensive! So we will take the easy route and raise an error whenever we encounter an unbound variable.
In Racket, our abstract syntax tree now includes the line:
[with (var symbol?) (bound-expr chai-expr?) (body chai-expr?)]
Note that as our language becomes more complex, we can make these language forms more convenient for the programmer. For instance, we may want to bind any number of expressions in a single “with,” instead of nesting an expression for each (way too much typing). This will come in due time. But first, let’s translate all of our ideas into code.
The Parser
Our entire abstract syntax tree now looks like this:
;; a chai expression, as defined in our blog post (define-type chai-expr [num (val number?)] [bool (val boolean?)] [prim (id symbol?) (args list?)] [ref (id symbol?)] [with (var symbol?) (bound-expr chai-expr?) (body chai-expr?)])
Implementing this in our parser is straightforward, but we have some structural changes from last time:
;; parse:unsupported: s-expr -> error ;; raise an error for unsupported language forms (define (parse:unsupported input-expr) (error (string-append "parse: unsupported language form: " (s-expr->string input-expr)))) ;; parse-expr: s-expr -> chai-expr ;; parse an s-expression into a chai expression, or ;; throw an error if it is not well-formed (define (parse-expr input-expr) (cond [(number? input-expr) (num input-expr)] [(eq? 'true input-expr) (bool #t)] [(eq? 'false input-expr) (bool #f)] [(symbol? input-expr) (ref input-expr)] [(list? input-expr) (let ([first-sym (first input-expr)] [list-len (length input-expr)]) (cond [(primitive-symbol? first-sym) (prim first-sym (map parse-expr (rest input-expr)))] [(and (equal? 'with first-sym) (equal? list-len 4)) (let ([id (second input-expr)] [bound-expr (parse-expr (third input-expr))] [body-expr (parse-expr (fourth input-expr))]) (if (symbol? id) (with id bound-expr body-expr) (error 'parse "bad variable name: ~a" id)))] [else (parse:unsupported input-expr)]))] [else (parse:unsupported input-expr)]))
Where the highlighted lines are new. Since we tire of writing (list? input-expr) and (first input-expr), etc., we extract those pieces first (anticipating we may need them for future languagge forms), and then do our usual check for a primitive operation. Finally, we recursively parse the “third” and “fourth” expressions, the binding expression and body expression, respectively, and pack them into a “with” type, raising an error if the variable which we’re trying to bind is not a symbol. For instance, we certainly wouldn’t want to allow the user to rebind things like numbers or parentheses. Though that’s not actually possible with this implementation, we should sensibly signal an error in the parser, since such an attempt is not well-formed.
With the relevant, added test cases, we see our parser is correct, and we turn to the interpreter.
The Interpreter
Now that we need to carry around an environment, we’d like to change our interpreter’s signature to:
;; interp: chai-expr environment -> chai-expr ;; interpret a chai-expression (define (interp input-expr environment) ... )
This raises the obvious question: how are we going to represent an environment? There are many ways, but in particular we don’t want to bind ourselves to a particular implementation (no pun intended). So before we get to the interpreter, let’s develop a “chai-environment.rkt” module which provides an interface for manipulating environments.
This interface should provide functions to create a new set of bindings, add a binding to a list of bindings, and lookup a binding by its identifier. This corresponds neatly to three functions:
;; new-environment: -> environment ;; create a new, empty environment (define (new-environment) ... ) ;; add-binding: environment symbol any -> environment ;; add a new binding to the environment (define (add-binding old-env id val) ... ) ;; lookup-binding: environment id -> any ;; lookup a binding, raising an error if the requested binding is not found. (define (lookup-binding envrionment sought-id) ... )
We will again use define-type to create a data-type for bindings, and type-case to access them.
;; the bindings type, a linked list of binding values (define-type environment [binding (id symbol?) (value any/c) (rest environment?)] [empty-env])
So an “environment” is just the head of a list, containing both a key/value pair, and a list of successive bindings. Implementing the above functions is now easy. For the first two, we just have
;; new-environment: -> environment ;; create a new, empty environment (define (new-environment) (empty-env)) ;; add-binding: environment symbol any -> environment ;; add a new binding to the environment (define (add-binding old-env id val) (binding id val old-env))
We just invoke the appropriate constructor, and in the future if we decide to change our environment structure (say, look things up with hash-tables), the rest of our code is unaffected by the change.
Finally, the look-up function requires actually requires a little bit of logic:
;; lookup-binding: environment id -> any ;; lookup a binding, throwing an error if the requested ;; binding is not found. (define (lookup-binding env sought-id) (type-case environment env [empty-env () (error 'lookup-binding "id ~a is unbound in the environment" sought-id)] [binding (id val rest-env) (if (eq? id sought-id) val (lookup-binding rest-env sought-id))]))
In particular, we recursively search through the list of bindings, comparing each id to the sought-id, and returning the associated value if it is found. Otherwise, we raise an error.
Bringing this back to our interpreter, we need to alter our interp function to accept a new parameter:
;; interp: chai-expr environment -> chai-expr ;; interpret a chai-expression (define (interp input-expr env) ... )
Now, when we see a variable reference, we just use our lookup function to extract the correct value. In the type-case, this adds the new line:
[ref (id) (lookup-binding env id)]
And finally, the with case looks like
[with (var bound-expr body-expr) (let ([value (interp bound-expr env)]) (interp body-expr (add-binding env var value)))]
where we recursively interpret the new binding, and then interpret the body expression with an augmented environment containing the new variable reference.
So, our entire interp function is as follows:
;; interp: chai-expr environment -> chai-expr ;; interpret a chai-expression (define (interp input-expr env) (type-case chai-expr input-expr [num (val) val] [bool (val) val] [ref (id) (lookup-binding env id)] [with (var bound-expr body-expr) (let ([value (interp bound-expr env)]) (interp body-expr (add-binding env var value)))] [prim (id args) (let* ([interpreted-args (map (λ (arg) (interp arg env)) args)] [operation (type-check/get-op id interpreted-args)]) (operation interpreted-args))]))
Note that in order to maintain our other language forms (in particular, prim), we need to alter our recursive calls to interp to pass along the environment appropriately.
And that’s all there is to it! As usual, we have a host of test cases which prove the correctness of the program, and this code is available on this blog’s Google Code page. And, of course, the reader is invited to evaluate programs at this blog’s interactive online interpreter. Happy coding!
Observations, and a Peek into the Future
Before we close, we note some interesting features of Chai so far. First, we can do some weird things with variables. In particular, while we can’t yet overwrite bindings of primitive operations, we can use primitive symbols as variable names. Here is a bizarre program that actually runs:
(with + 7 (+ + +))
Technically, we haven’t “overwritten” the plus primitive operator, since it still functions as we expect it to. In the future, we may experiment with allowing the programmer to locally overshadow primitive operations with their own functions (if only just for fun).
Second, we recognize a very important choice we made in designing the with form. Specifically, in the interpreter, we first interpret the bound expression, and then bind it to the variable. The astute reader might ask, why bother? Why don’t we just leave the expression unevaluated until something references it? For instance, this would allow us to write programs like:
(with x (/ 1 0) (+ 2 3))
Which would run without error. It turns out that this choice has a name! It’s called “laziness,” and it’s used as the default in some significant programming languages, including Haskell. Haskell has a number of other very interesting features, especially pertaining to the study of logic and semantics. For instance, the Haskell compiler refuses to run a program unless it can prove that the program outputs the correct type at each step, and hence guarantee that the program cannot perform certain misbehaviors. We find this all quite fascinating, and add it to our lengthy list of Wonderful Things to Learn and Implement.
Until then!