Chai – Functions with Applications

As usual, we invite the reader to follow along with the source code from this blog’s Google Code page, and to test programs with the interactive online interpreter.

One interpretation of a "Chai function."

Reserved Words, Values, and Ifs

In our last version of Chai we noticed a few things that needed fixing, so we’ll address those before moving on. First, consider the following chai program:

(with true false (or true true))

The user, being as odd as users will be, wants to bind the value “false” to the variable whose name is “true,” in a dirty attempt to trick somebody. The last version of Chai (see, Chai – Environments and Variables) interpreted this expression as “true,” and rightfully so in this mathematician’s opinion, since truth is not something which can be changed!

On the other hand, something just doesn’t feel right about this program. We can either allow such programs to be written (and wag a stern finger at all those who do!), or forbid the program from continuing once we find such a transgression. The latter makes more sense, and it introduces reserved words into Chai. As of the end of this post, our list of reserved words contains:

true false with if end pair fun

And we will add to this list as necessary. It’s ambiguous whether we should include things like +, -, *, /, etc. Someone may want to redefine what “plus” means, and that might not be an awful thing, say, once every two thousand years. For now, we will leave it alone. At least, we may stop someone from writing the appalling program:

(with with 1 (with with (with with with with) with))

All this requires to our code is that during parsing, whenever we inspect a variable binding (or list of variable bindings, or list of function arguments), we must ensure none are reserved words. This is straightforward enough, and the interested reader may inspect the source code.

Next, we realized ahead of time the need to discriminate between the input to our interpreter and its output. This is a direct result of our desire to add support for functions and basic lists. Now we have a new EBNF for “Chai values”:

chai-value = number
           | boolean
           | string
           | (pair chai-value chai-value)
           | end

One might notice that we have added strings, pairs (lists are nested series of pairs), and end, a null-value, named so to signify the terminating value of a list. In addition, we’ve added primitives for a whole host of operations, including: =, !=, <=, >=, <, >, pair (pair construction), end (list termination), first and rest (pair extraction), and end? (testing if a list is empty). We will certainly add more in the future, but for now this is enough.

In Racket code, we have the following new data type:

(define-type chai-value
  [num-v (val number?)]
  [bool-v (val boolean?)]
  [str-v (val string?)]
  [pair-v (head chai-value?) (tail chai-value?)]
  [end-v])

The new discrimination between Chai expressions and Chai values affects every branch of our interpreter and in every primitive operation, but only in how we accept and package up our data types. The interested reader can check out the changes in the source code; we will not discuss them here. (But we will mention that none of the changes occur in the type-checker. Nice.)

Finally, we’ve added “if” expressions for control flow. In the EBNF, an if looks like:

expr = ...
     | (if expr expr expr)

Where the first expression is the “test” expression, and the second is the “then” expression, and the third is the “else” expression. This translates directly to Racket as:

(define-type chai-expr
   ...
   [if? (test chai-expr?) (then-branch chai-expr?) (else-branch chai-expr?)])

Because the symbol ‘if’ conflicts with the Racket language form, we rename it ‘if?’; note that in Chai programs, the language form is still “if”. The parser is straightforward for if expressions, so we’ll jump straight to the interpreter. The new clause is:

(define (interp input-expr env)
  (type-case chai-expr input-expr
    ...
    [if? (test-expr then-branch else-branch)
         (let ([test-result (interp test-expr env)])
           (type-case chai-value test-result
             [bool-v (test-value) (interp (if test-value then-branch else-branch) env)]
             [else (error 'interp "test expression was not a boolean: ~a"
                          (chai-expr->string test-expr))]))]))

Note that we need to ensure the result of evaluating the test-expression is a boolean, and we ignore the unevaluated branch. For instance, the following program will run in Chai:

(if true 7 (/ 1 0))

even when the else branch will throw an error as a standalone program.

Functions

Finally, we’ve reached a very important milestone in Chai: figuring out what to do with functions. Let’s start with the basic syntax. In EBNF, a function has the form

expr = ...
     | (fun (vars ...) expr)

Where “vars …” indicates the presence of zero or more identifiers representing input variables, which are collectively referred to as “vars”. For instance, here is a Chai program which evaluates to the identity function:

(fun (x) x)

In Chai, we will treat functions the same way we treat any other primitive value. One can pass them around as if they were a number. For instance, here is a function which accepts one argument x, and returns a function which accepts one argument y and adds x to y:

(fun (x) (fun (y) (+ x y)))

Applying this function to a number returns a new function. In some languages, functions are required to be defined “at the top level,” i.e., given special names and put in special places. Since Chai expressions still consist of a single expression, we don’t have anywhere else to put them. So if one wants to “name” a function one can, but as it is most functions will be “anonymous.”

Speaking of function applications, they look like:

expr = ...
     | (expr expr ...)

Where the first expression should result in a function, and the remaining zero or more results should result in arguments to the function. Note that syntactically, these “shoulds” are irrelevant.

So here is a program which applies the previous example, which we name “make+=”, to some values:

(with make+= (fun (x) (fun (y) (+ x y)))
   (with +=7 (make+= 7)
     (pair (+=7 9) (pair (+=7 14) end))))

The advanced readers will recognize this as an explicit form of “currying” (accepting a partial list of arguments). We may in the future decide to have Chai support implicit currying, so that we may rewrite the above program as something like:

(with +=7 (+ 7)
  (pair (+=7 9) (pair (+=7 14) end)))

But for now let’s focus on the implementation.

The Parser

In the parser, our new branches look like:

[(and (equal? 'fun first-thing)
      (equal? 3 list-len)
      (list? (second input-expr)))
 (let ([vars (check-vars (second input-expr))]
       [body (parse-expr (third input-expr))])
   (function vars body))]
[else
 (app (parse-expr first-thing)
      (map parse-expr (rest input-expr)))]))] 

To elaborate, these two branches are wrapped within the condition that the expression is a list. If the first element of the list is not a reserved keyword like “with”, “if”, or “fun”, we assume the expression is a function application. Indeed, we could have some weird expressions that evaluate to functions, so we can’t discriminate any further here.

For functions, we have to ensure some things about the list of input variables. For instance, we can’t allow the user to repeat the same variable name twice, or use any reserved words, so we encapsulate both of those checks into a function called “check-vars”:

;; check-vars: listof symbol -> listof symbol
;; check a list of variable names for invalidity
(define (check-vars names)
  (begin
    (map (λ (name) (when (reserved? name) (parse:bad-identifier name))) names)
    (when (has-duplicates? names) (error 'parse "list of variables has duplicates: ~a" names))
    names))

We leave the implementation of has-duplicates? as an exercise to the reader (with the solution in chai-utils.rkt), and reserved? is just a membership test in a list of symbols (in chai-ast.rkt).

While it has its details, the parser is the easy part of implementing functions. We didn’t actually have to make any important decisions. We’ll see the semantics are a bit more complicated.

Scope, and Closures

In general, the scope of a variable x refers to the places in the program where we can inspect the value of x. For instance, in the following program x is only in scope within the “with” expression.

(with y (+ x y)
  (with x 7) y)

So, in fact, the reference to x before the nested with statement is invalid, and this program will raise an error upon evaluation. Note that y is in scope in the nested with. Recalling how our interpreter works, this is obvious, because we only augment the environment with new variables after encountering their definition in a with.

To illustrate the confusion arising in discussions about scope, we have to investigate functions. In particular, the definition of a function can refer to a variable defined somewhere else, but the body of the function isn’t evaluated until it’s actually called. Since functions can be passed around like numbers, it raises ambiguity as to which variable we should refer to during a function application.

For instance, we could write the following program and ask what it should do:

(with my-function (with x 33
                     (fun (y) (+ x y)))
   (with x 44 (my-function 55)))

Should this program evaluate to 88 or 99? Before continuing, make a guess as to what will happen (and try to argue why your choice is the best option), and verify it by running the program through our interactive interpreter.

In fact, there is no completely “correct” answer to this question, and it highlights the differences between lexical and dynamic scope. In the latter, a variable will always refer to the most recent binding which is in scope. In other words, if Chai had dynamic binding, the program above would evaluate to 99. On the other hand, if we implemented lexical scope, “my-function” would retain the information about what variables were in scope during its definition, and so the resulting call would evaluate to 88.

The engaged reader will have by now verified that Chai has lexical scope, and not dynamic scope (in the future, we may provide an alternative version of Chai which has dynamic scope). Most contemporary programming languages implement lexical scope, because it just so happens that otherwise the logic flow in a program is harder to determine just by reading it. However, some popular languages originally had dynamic scope, and later decided to tack on support for lexical scoping. Sometimes lexical scope is referred to as static scope.

In order to support lexical scope in Chai, we have to now discriminate between a function definition, and the function definition combined with the variables that are in scope during definition. In other words, we need the result of interpreting a function definition to contain both pieces of information. In particular, we call this piece of data a closure.

Now the chai-value datatype has an additional branch:

chai-value = number
           | boolean
           | string
           | (pair chai-value chai-value)
           | end
           | (closure environment (vars...) chai-expr)

Where the environment is our datatype for variable bindings, the vars… is the same list of function variables from our function definitions, and the chai-expr is the body expression, in unevaluated form. (Indeed, we can’t evaluate the body until it’s applied to some arguments!)

So this adds one more branch to our type-case:

(define-type chai-value
   ...
   [closure-v (env environment?) (vars (listof symbol?)) (body chai-expr?)])

Now we are ready to interpret functions, and see what we can do with them.

The Interpreter

In fact, interpreting function definitions is easy! Since we have already been keeping track of the environment all along, we can package it up in a closure in a single line:

(define (interp input-expr env)
  (type-case chai-expr input-expr
     ...
    [function (vars body) (closure-v env vars body)]))

Applying a function is quite a bit more work. In particular, we have two things we need to check before believing that the user knows how to write programs:

  • Is the first thing in a function application a function? (Does it evaluate to a closure?)
  • Does the number of provided arguments match the number of accepted arguments?

If neither of these are the case, we need to raise an error. Assuming these requirements are met, we can perform the actual function application as follows:

(let* ([interped-args (map (λ (arg) (interp arg env)) args)]
       [new-env (foldl add-binding closure-env vars interped-args)])
  (interp body new-env))

where args, vars, closure-env, and body are all the expected parts of our function application and closure. Note that we are very specific to augment the closure environment with the new bindings for the function arguments. This is the key of lexical scope, whereas dynamic scope would use the standard environment.

With all of the case-checking, the branch in our interpreter is a bit longer:

(define (interp input-expr env)
  (type-case chai-expr input-expr
    ...
   [app (clo args)
         (let ([expected-clo (interp clo env)])
           (type-case chai-value expected-clo
              [closure-v (closure-env vars body)
                         (let ([arg-count (length args)]
                               [var-count (length vars)])
                           (when (not (eq? arg-count var-count))
                             (error 'interp "expected ~a args for a function application, but got ~a" var-count arg-count))
                           (let* ([interped-args (map (λ (arg) (interp arg env)) args)]
                                  [new-env (foldl add-binding closure-env vars interped-args)])
                             (interp body new-env)))]
              [else (error 'interp "invalid function in function application: ~a" (chai-expr->string expected-clo))]))]))

Try to follow the logic above. Even for the author, determining what someone else’s code does is a nontrivial process, so don’t worry if it’s overwhelming at first. Just remember that most of it is error-checking, and the key part is the three lines where we actually perform the function application. Of course, to get an even better understanding, try to run programs in our interactive interpreter which hit every error case, and compare that with the flow of logic in the code above.

Recursion, But Not Quite Easy Enough

The experienced reader might be asking herself “Why haven’t we implemented any loops yet? We can’t really do anything interesting without loops.” And she is correct! A programming language cannot be Turing-complete (informally, have the true power of a computer) if it can’t branch out, or make programs which run indefinitely.

As of now, we couldn’t, say, write the function map (see our functional programming primer, A Taste of Racket, for a definition). If we try, we get a nasty error:

(with map (fun (f list)
            (if (end? list)
                end
                (pair (f (first list)) (map f (rest list)))))
  (map (fun (x) (+ 1 x)) (pair 7 (pair 6 (pair 5 end)))))

It complains that “map” is not bound in the current environment, which is a shame, because map is a really great function to have around, and we can’t build it yet (well, that’s not quite true, but it wouldn’t be as straightforward as what we have above).

It turns out that even though we can’t quite do this kind of recursion with names, we can still achieve infinite loops in Chai! We challenge the reader to figure out how to do it, and see what our interactive interpreter does in response. As a hint, the shortest solution the author has found is a mere 33 characters long, including spaces and parentheses, and it only requires functions and applications.

Some Further Issues to Address

We recall saying we would like the user to be able to redefine what “+” means via a with expression. However, upon trying to evaluate the following program which “adds” two functions, we get an error:

(with + (fun (f g) (fun (arg) (+ (f arg) (g arg))))
  ((+ (fun (x) (+ x 1)) (fun (y) (* y  2))) 7))

In order to fix this, we need to stop distinguishing between primitive operations and functions which are bound to symbols in our environment. In other words, we want to pre-populate the environment with bindings for primitive functions, which can be overwritten by the programmer. This will simplify our code significantly, since we can move even more of the logic outside of the interpreter and parser, and think of everything as a function application. This is something we have to look forward to in the future.

As long as we’re thinking of everything as a function application, why don’t we simplify Chai even further so that we only have functions and applications? No numbers, no primitives, no conditionals. What sort of language would we get? We’ll see just what happens with a little intermezzo next time. Of course, we will only do this temporarily to explore its theoretical properties, and then we’ll return to the civilized world.

Until then!

Chai – Environments and Variables

As usual, we encourage the reader to follow along with our online interpreter for this version of Chai. In addition, the source code for this version of Chai is available on this blog’s Google Code page. So go ahead: experiment, discover, and play!

The Dr. Racket programming environment (not to scale, color, font, or positioning) 🙂

Variables, Variables, Variables!

A land without variables would be quite ugly. Runtimes would be sluggish, superfluous arithmetic expressions would litter every corner, and the architects (we coders) would be quite melancholy. Want to reuse a value or function? Forget about it. Want to inspect a randomly generated number more than once? Yeah right. Want to have code that someone else can actually read? Dream on.

Luckily, we’re on an exodus to a better place. A land of loops, first-class functions, and all sorts of other neat language forms. Let’s begin by implementing variables and variable references.

Substitution, Schmubstitution

The first naive approach to variable references would go something like this: whenever the programmer binds a variable, say “x” is bound to 7, the interpreter goes ahead and replaces every occurrence of the symbol “x” with the value 7. So a (pseudocode) program like

bind x to 7
bind y to 8
x + y

would first evaluate to

bind y to 8
7 + y

and then to

7 + 8

and finally result in 15. Of course, this is fine for simple arithmetic expressions, but what if we want to write a program like

bind x to 1
print x
bind x to (x + 1)
print x
...

This kind of thing happens all the time in (procedural) programs, specifically when we’re looping over a counting variable. With our current substitution method, this program would evaluate (with a grain of sense) to something counter-intuitive:

print 1
bind x to (1 + 1)
print 1
...

This is obviously not what the programmer has in mind. So what we really want with a variable is to capture the current value of a variable at the time it is referenced. In other words, we want to keep track of the scope of our variables. Plainly speaking, a variable’s scope is the part of the program where its values may be accessed. Usually, and for the remainder of this post, such “parts of programs” will be nested expressions. Note that we will go into more depth about scope in future posts, in particular the difference between dynamic scope and lexical scope, but for now the difference is moot, because our programs consist of but a single expression.

Of course there are ways to avoid conflicting substitutions, but they are more work than the solution we will arrive at anyway. So let’s skip banal direct substitution method and go straight to a more civilized discussion.

Environments

Environments will be our alternative to substitution. An environment is simply a mapping from variable names to values which is updated as a program runs and new variable bindings are introduced. In other words, as our program evaluates it has a new piece of data floating around with it. Returning to our problematic program above, it might evaluate as follows:

bind x to 1
print x
bind x to (x + 1)
print x

In the first step, we add “x” to the environment and give it a value of one:

[environment: x -> 1]
print x
bind x to (x+1)
print x

When the print statement is reached, the interpreter looks up the appropriate value of x in the environment, and uses that value. After the subsequent bind statement, the evaluation looks like

[environment: x -> 2, x -> 1]
print x

In the environment, we simply use the first binding of x as our replacement value. The astute reader might ask: “Why bother keeping the first binding of x around at all?” This is a reasonable question in the simple pseudocode language we have here, but once we implement our syntax form for Chai, we will find virtue in this peculiar decision. In general, the argument in favor of environments will become stronger and stronger as we add more features to Chai. So without further ado, let us design the syntax form for variable bindings and references.

New Syntax: with and ref

Variable references are easy. We simply add a new language that consists of any valid identifier (here, a symbol as judged by Racket). Our EBNF syntax tree now includes the line:

expr = ...
     | variable-reference

where a variable reference is a string of characters which is recognized as an identifier by Racket. Note that for our purposes we don’t actually care about these kinds of details, and the Racket specification is fairly liberal with identifiers. This means we can use all of the nice descriptive variable names like “number->string”. The specification is more explicit when we implement the additional line of code in our AST:

[ref (id symbol?)]

For binding new variables, we will make variable scope explicit. In particular, we want the programmer to be responsible for telling the interpreter where variables are in scope. This extra burden on the programmer is not so bad. In most languages, the scope of a variable is determined by a set of rules, and here we just trade memorization for explicitness and ease of writing the interpreter. So now we introduce the “with” form, which binds a single variable and describes its scope. Its EBNF:

expr = ...
     | (with variable-name expr expr)

The first sub-expression corresponds to the value of the new variable, and the second sub-expression is the “body” of the with. In particular, the result of the “with” expression is the result of the body expression, after adding these new bindings. In our interpreter, the body will be the only place that references to this particular variable are granted. For example,

(with x (+ 3 4) (* x x))

evaluates to 49, but the following program, while well-formed, might sensibly raise an error upon evaluation:

(with x 2 (with y 3 z))

Since “z” is a free variable. Here the environment only contains “x” and “y”, and “z” is unbound. Another language designer (say, a designer of Mathematica) might reasonably wish to run programs with free variables evaluating trivially to themselves. Unfortunately, doing this right requires a lot of overhead which we are simply not equipped to handle yet. To a mathematician, for instance, the program “(+ 1 (+ 2 z))” should simplify to “(+ 3 z)”, but our primitive expressions are not set up to allow for simplification. There’s a reason Mathematica is so expensive! So we will take the easy route and raise an error whenever we encounter an unbound variable.

In Racket, our abstract syntax tree now includes the line:

[with (var symbol?) (bound-expr chai-expr?) (body chai-expr?)]

Note that as our language becomes more complex, we can make these language forms more convenient for the programmer. For instance, we may want to bind any number of expressions in a single “with,” instead of nesting an expression for each (way too much typing). This will come in due time. But first, let’s translate all of our ideas into code.

The Parser

Our entire abstract syntax tree now looks like this:

;; a chai expression, as defined in our blog post
(define-type chai-expr
  [num (val number?)]
  [bool (val boolean?)]
  [prim (id symbol?) (args list?)]
  [ref (id symbol?)]
  [with (var symbol?) (bound-expr chai-expr?) (body chai-expr?)])

Implementing this in our parser is straightforward, but we have some structural changes from last time:

;; parse:unsupported: s-expr -> error
;; raise an error for unsupported language forms
(define (parse:unsupported input-expr)
  (error (string-append "parse: unsupported language form: "
                        (s-expr->string input-expr))))

;; parse-expr: s-expr -> chai-expr
;; parse an s-expression into a chai expression, or
;;  throw an error if it is not well-formed
(define (parse-expr input-expr)
  (cond [(number? input-expr) (num input-expr)]
        [(eq? 'true input-expr) (bool #t)]
        [(eq? 'false input-expr) (bool #f)]
        [(symbol? input-expr) (ref input-expr)]
        [(list? input-expr)
         (let ([first-sym (first input-expr)]
               [list-len (length input-expr)])
           (cond [(primitive-symbol? first-sym)
                  (prim first-sym (map parse-expr (rest input-expr)))]
                 [(and (equal? 'with first-sym) (equal? list-len 4))
                  (let ([id (second input-expr)]
                        [bound-expr (parse-expr (third input-expr))]
                        [body-expr (parse-expr (fourth input-expr))])
                    (if (symbol? id) (with id bound-expr body-expr)
                        (error 'parse "bad variable name: ~a" id)))]
                 [else (parse:unsupported input-expr)]))]
        [else (parse:unsupported input-expr)]))

Where the highlighted lines are new. Since we tire of writing (list? input-expr) and (first input-expr), etc., we extract those pieces first (anticipating we may need them for future languagge forms), and then do our usual check for a primitive operation. Finally, we recursively parse the “third” and “fourth” expressions, the binding expression and body expression, respectively, and pack them into a “with” type, raising an error if the variable which we’re trying to bind is not a symbol. For instance, we certainly wouldn’t want to allow the user to rebind things like numbers or parentheses. Though that’s not actually possible with this implementation, we should sensibly signal an error in the parser, since such an attempt is not well-formed.

With the relevant, added test cases, we see our parser is correct, and we turn to the interpreter.

The Interpreter

Now that we need to carry around an environment, we’d like to change our interpreter’s signature to:

;; interp: chai-expr environment -> chai-expr
;; interpret a chai-expression
(define (interp input-expr environment)
   ... )

This raises the obvious question: how are we going to represent an environment? There are many ways, but in particular we don’t want to bind ourselves to a particular implementation (no pun intended). So before we get to the interpreter, let’s develop a “chai-environment.rkt” module which provides an interface for manipulating environments.

This interface should provide functions to create a new set of bindings, add a binding to a list of bindings, and lookup a binding by its identifier. This corresponds neatly to three functions:

;; new-environment: -> environment
;; create a new, empty environment
(define (new-environment) ... )

;; add-binding: environment symbol any -> environment
;; add a new binding to the environment
(define (add-binding old-env id val) ... )

;; lookup-binding: environment id -> any
;; lookup a binding, raising an error if the requested binding is not found.
(define (lookup-binding envrionment sought-id) ... )

We will again use define-type to create a data-type for bindings, and type-case to access them.

;; the bindings type, a linked list of binding values
(define-type environment
  [binding (id symbol?) (value any/c) (rest environment?)]
  [empty-env])

So an “environment” is just the head of a list, containing both a key/value pair, and a list of successive bindings. Implementing the above functions is now easy. For the first two, we just have

;; new-environment: -> environment
;; create a new, empty environment
(define (new-environment)
  (empty-env))

;; add-binding: environment symbol any -> environment
;; add a new binding to the environment
(define (add-binding old-env id val)
  (binding id val old-env))

We just invoke the appropriate constructor, and in the future if we decide to change our environment structure (say, look things up with hash-tables), the rest of our code is unaffected by the change.

Finally, the look-up function requires actually requires a little bit of logic:

;; lookup-binding: environment id -> any
;; lookup a binding, throwing an error if the requested
;;  binding is not found.
(define (lookup-binding env sought-id)
  (type-case environment env
    [empty-env () (error 'lookup-binding
                         "id ~a is unbound in the environment"
                         sought-id)]
    [binding (id val rest-env) (if (eq? id sought-id) val
                               (lookup-binding rest-env sought-id))]))

In particular, we recursively search through the list of bindings, comparing each id to the sought-id, and returning the associated value if it is found. Otherwise, we raise an error.

Bringing this back to our interpreter, we need to alter our interp function to accept a new parameter:

;; interp: chai-expr environment -> chai-expr
;; interpret a chai-expression
(define (interp input-expr env) ... )

Now, when we see a variable reference, we just use our lookup function to extract the correct value. In the type-case, this adds the new line:

[ref (id) (lookup-binding env id)]

And finally, the with case looks like

[with (var bound-expr body-expr)
      (let ([value (interp bound-expr env)])
        (interp body-expr (add-binding env var value)))]

where we recursively interpret the new binding, and then interpret the body expression with an augmented environment containing the new variable reference.

So, our entire interp function is as follows:

;; interp: chai-expr environment -> chai-expr
;; interpret a chai-expression
(define (interp input-expr env)
  (type-case chai-expr input-expr
    [num (val) val]
    [bool (val) val]
    [ref (id) (lookup-binding env id)]
    [with (var bound-expr body-expr)
          (let ([value (interp bound-expr env)])
            (interp body-expr (add-binding env var value)))]
    [prim (id args)
          (let* ([interpreted-args (map (λ (arg) (interp arg env)) args)]
                 [operation (type-check/get-op id interpreted-args)])
            (operation interpreted-args))]))

Note that in order to maintain our other language forms (in particular, prim), we need to alter our recursive calls to interp to pass along the environment appropriately.

And that’s all there is to it! As usual, we have a host of test cases which prove the correctness of the program, and this code is available on this blog’s Google Code page. And, of course, the reader is invited to evaluate programs at this blog’s interactive online interpreter. Happy coding!

Observations, and a Peek into the Future

Before we close, we note some interesting features of Chai so far. First, we can do some weird things with variables. In particular, while we can’t yet overwrite bindings of primitive operations, we can use primitive symbols as variable names. Here is a bizarre program that actually runs:

(with + 7 (+ + +))

Technically, we haven’t “overwritten” the plus primitive operator, since it still functions as we expect it to. In the future, we may experiment with allowing the programmer to locally overshadow primitive operations with their own functions (if only just for fun).

Second, we recognize a very important choice we made in designing the with form. Specifically, in the interpreter, we first interpret the bound expression, and then bind it to the variable. The astute reader might ask, why bother? Why don’t we just leave the expression unevaluated until something references it? For instance, this would allow us to write programs like:

(with x (/ 1 0) (+ 2 3))

Which would run without error. It turns out that this choice has a name! It’s called “laziness,” and it’s used as the default in some significant programming languages, including Haskell. Haskell has a number of other very interesting features, especially pertaining to the study of logic and semantics. For instance, the Haskell compiler refuses to run a program unless it can prove that the program outputs the correct type at each step, and hence guarantee that the program cannot perform certain misbehaviors. We find this all quite fascinating, and add it to our lengthy list of Wonderful Things to Learn and Implement.

Until then!

Chai – Arithmetic and Organization

As you read along, why not try writing some programs and seeing how they evaluate in this version of Chai? You can do so at our live web server.

Organization

Last time, we implemented a simple interpreter for a language which exclusively computed sums of numbers. Before we continue with more interesting language forms, we need to give ourselves some meat to work with.

Specifically, we need to add more primitive operations (+, -, *, /, and, or, not, …), and modularize our files so we can keep track of things better. After we cover the new syntax forms, this post will be admittedly be technical; we’ll spend most of our efforts here designing look-up tables, and implementing simple type-checking for primitive operations. Afterward, we will be able to add new primitive operations and data types without modifying our interpreter, and only slightly modifying our parser. The point of all this is to hide the messy code which handles primitives, so that we can better focus on more interesting language forms. For those readers without a neurotic desire to organize code (and hence no desire to read the entire post), just remember that in the future, we will add new primitives without detailing the implementation. For reference, the source code is available on this blog’s Google Code page.

The first change we’d like to make is to move our abstract syntax tree to a separate file, “chai-ast.rkt.” In order to have all the generated functions (constructors and field accessors) available for the rest of our files to use, we need a special “provide” command:

(provide (all-defined-out))
(require plai)

;; a chai-expression
(define-type chai-expr
  [num (val number?)]
  [sum (lhs chai-expr?) (rhs chai-expr?)]))

So (all-defined-out) is a special function which provides all functions defined in this file. Now, instead of having separate language form for each primitive operation (say, a sum, a diff, a mult, a div, etc), we can abstract this to a “primitive” form. In other words, our “sum” line changes to this:

 [prim (id symbol?) (args list?)] 

We also recognize that we will want to include unary primitives, and ternary primitives, so we have an identifying symbol and a list of arguments. In other words, our new EBNF syntax looks like:

expr = number
     | (prim expr expr...) 

prim = + | - | * | /

Where the “expr…” notation means we allow zero or more following things which are exprs. For now, we do not have any null-ary operations (with zero arguments). Finally, we’d like to include boolean-valued expressions and boolean algebra as well, so this extends our syntax to:

expr = number
     | true
     | false
     | (prim expr expr...) 

prim = + | - | * | / | and | or | not

Note that the syntax does not keep track of and information associated with the operations, like type or arity. Syntactically, it is perfectly okay to write the program: “(not 1 2 3)”. This is well-formed, because the first thing is a primitive symbol, and the other things are valid arguments. We will leave the semantics of these primitive operations to the interpreter.

Implementing all of this in our abstract syntax tree is straightforward:

;; a chai expression
(define-type chai-expr
  [num (val number?)]
  [bool (val boolean?)]
  [prim (id symbol?) (args list?)])

Now let us turn to the parser, which is about as simple as it was last time. Again, we assume that we have already converted an input string into an s-expression (via the utility functions in chai-utils.rkt).

;; parse-expr: s-expr -> chai-expr
;; parse an s-expression into a chai expression,
;;  or throw an error if it is not well-formed
(define (parse-expr input-expr)
  (cond [(number? input-expr) (num input-expr)]
        [(eq? 'true input-expr) (bool #t)]
        [(eq? 'false input-expr) (bool #f)]
        [(and (list? input-expr)
              (primitive-symbol? (first input-expr)))
         (prim (first input-expr)
               (map parse-expr (rest input-expr)))]
        [else (error (string-append
                       "parse: unsupported language form: "
                       (s-expr->string input-expr)))]))

The first three cases of the “cond” expression are trivial. In the fourth, we introduce a new function, which determines whether the first thing in our s-expression is a primitive symbol. Of course, we actually need to write this function, but before we do that let’s come up with a good way to structure a table with information about primitives.

We create a new file called “chai-primitives.rkt” just for this purpose. In it, we’ll have a look-up table with some information about primitives:

;; a helper function for the primitive operation table
(define (make-op op) (λ (arglist) (apply op arglist)))

;; primitive operation lookup-table
;; symbol -> (list op num-args (list type1? type2? ...))
(define primitives-table
  (hasheq '+ (list (make-op +) 2 (list number? number?))
          '- (list (make-op -) 2 (list number? number?))
          '* (list (make-op *) 2 (list number? number?))
          '/ (list (make-op /) 2
                   (list number?
                         (λ (x) (if (equal? 0 x)
                                    (error "division by zero!")
                                    (number? x)))))
          'or (list (λ (args) (ormap (λ (x) x) args))
                    2 (list boolean? boolean?))
          'and (list (λ (args) (andmap (λ (x) x) args))
                    2 (list boolean? boolean?))
          'not (list (make-op not) 1 (list boolean?))))

“hasheq” creates a new hash table where the “eq?” function is used to compare keys. In this case, the keys are Racket symbols, and “eq?” happens to work very well with symbols. Associated to each  symbol is a list of three things. First, the operation that will be applied to the arguments. Since each function has to be applied to a packaged list of values, we will construct the “operation” via make-op (with the exception of and and or, for quirky Racket reasons), which creates a function that applies the given operation to a list of arguments. Second, we have a number representing the allowed number of arguments; and finally, we have a list of functions. In this list, the nth function is used to determine whether the nth argument has the correct type (and they are applied in order). This allows us to package important information here, such as the condition that the second argument of division is never zero. This sort of ho-hum logic would severely clutter our interpreter.

But instead of allowing our other files to access the primitives table willy-nilly, we should create an extra layer of protection, which will make it easier for us to maintain our code in the future. Specifically, we want a function which determines if a given symbol is a key in the table, and a function to retrieve the contents of the table. We do this as follows:

;; primitive-symbol?: symbol -> boolean
;; return true if the given symbol is a primitive
(define (primitive-symbol? id)
  (hash-has-key? primitives-table id))

;; get-prim: symbol -> function number (list function)
;; lookup the data associated with a primitive
(define (get-prim id)
  (apply values (hash-ref primitives-table id)))

The bit with “(apply values … )” takes the items in the list resulting from the hash-lookup, and returns them as multiple values. Note that in Racket, functions can have multiple return values. As we will see, this decision makes our code shorter. Instead of writing code to extract the pieces of the list, we just require the caller to bind all three return values to variables at the same time. Our foresight tells us that when we are type-checking, we will need all of this information at the same time, so requiring us to bind them all is no hindrance. Providing these two functions gives us all we need to continue with our interpreter.

The actual interpreter is quite simple now. There are only three cases:

;; interp: chai-expr -> chai-expr
;; interpret a chai-expression
(define (interp input-expr)
  (type-case chai-expr input-expr
    [num (val) val]
    [bool (val) val]
    [prim (id args)
      (let* ([interped-args (map interp args)]
             [operation (type-check/get-op id interped-args)])
        (operation interpreted-args))]))

At this point, we note that at some point we will want to perform type-checking and retrieve the required primitive operation given its identifier. We also recognize that this code belongs elsewhere. So we defer the logic to a function called “type-check/get-op”, which accepts the primitive operation identifier, and the list of arguments (after they’ve themselves been interpreted!), and returns the function which may then be applied to a list of values, throwing an error if there is an incorrect number of arguments or if any has the wrong type. The rest of the logic that goes into the interpreter is quite clear.

Type Checking

We write a ghostly shell of a type-checker as follows:

;; type-check/get-op: symbol list -> function
;; check the types and number of the arguments for the primitive
;;  'id', returning the primitive operation on success, and
;;  throwing an error on failure
(define (type-check/get-op id args)
  (let-values ([(op num-args-allowed arg-types) (get-prim id)])
     op))

For starters, we just return the operation as it is. Note that let-values allows us to bind all three results of the call to “get-prim” simultaneously.

To fill in the gaps, we should count the number of arguments in the “args” variable, and check that against “num-args-allowed”:

;; type-check/get-op: symbol list -> function
(define (type-check/get-op id args)
  (let-values
      ([(num-args-received) (length args)]
       [(op num-args-allowed arg-types) (get-prim id)])
    (if (not (equal? num-args-received num-args-allowed))
        (error 'type-check
               "~a expected ~a args, but received ~a."
               id num-args-allowed num-args-received)
        op)))

Note that we need the additional parentheses around (num-args-received) in the binding clause, because let-values expects each clause to have a list of identifiers. In addition, the “error” function accepts a formatting string similar to the kind encountered in C’s printf function. However, here the “~a” variable allows one to write out whatever value one wants. The details of Racket’s flavor of formatted strings is in the documentation here. All we need to know here is that “~a” works for strings and numbers.

Now that we’ve checked that there are the correct number of arguments, let us ensure they have the right type. To do this, we simply apply each element of “arg-types” (a list of boolean-valued functions) to the corresponding element of “args”:

(andmap (λ (a b) (a b)) arg-types args)

Here “(λ (a b) (a b))” is an anonymous function which accepts two arguments, and applies the first argument as a function call on the second argument. If we “map” this lambda expression over the list of arg-types and args, we will get a list of booleans representing whether the arguments have the correct types! Going slightly further, “andmap” performs such a “map,” and returns true if and only if each value in the resulting map is true (probably short-cutting by terminating upon finding a single false value). Thus, this expression is true precisely when the arguments type-check!

Putting this back into our type-check/get-op function, we are finished:

;; type-check/get-op: symbol list -> function
(define (type-check/get-op id args)
  (let-values ([(num-args-received) (length args)]
               [(op num-args-allowed types) (get-prim id)])
    (if (not (equal? num-args-received num-args-allowed))
        (error 'type-check
               "~a expected ~a args, but received ~a."
               id num-args-allowed num-args-received)
        (let ([good-args? (andmap (λ (a b) (a b)) types args)])
          (if good-args?
              op
              (error 'type-check
      "one of the arguments to ~a has the wrong type: ~a"
      id (string-join (map chai-expr->string args) ", ")))))))

[We apologize for the sloppy indenting in the last two lines; it is better than forcing the reader to scroll sideways. For a more pleasant indentation scheme (with colors, parenthesis matching, and more!), view the file in DrRacket, the standard Racket IDE.]

So that’s it! Note now that we may add in new primitive operations without changing any of the code in the parser, type-checker, or interpreter! From now on, when we need a new primitive, we will simply add it and mention its name. We will not bother the reader with its implementation, because it is nothing more than another entry in the look-up table, which is out of sight and out of mind.

Adequately many test cases are provided in the source code, so that the user may verify the interpreter’s correctness. Of course, as we repeat time and time again, the full source code for this post is available on this blog’s Google Code page. Additionally, we have a server accepting and evaluating your every program! Just amble right on over to the Chai Interpreters page to give it a swing.

Next time,  we’ll get started with the first big consideration in our language: how to handle scope. We’ll introduce a language form that binds values to variables, and come up with a system for referencing them. Until then!

Chai – The Most Basic Interpreter

While you read along, why not try evaluating programs through our interactive interpreter? As usual, the source code for all the work done in this post is available at this blog’s Google Code page. For more information on Racket, see the Quick Introduction to Racket or the more extensive Racket Guide.

Our Very First Language

A journey of a thousand miles starts with adding numbers.

Recall from our introduction that our theoretical design cycle for new features in Chai is as follows:

  • Add new syntax forms
  • Decide how those forms should be interpreted

So let us implement the most basic language we can imagine, with just two language forms: one for numbers and one for adding numbers. By starting with something so simple, we will have a skeleton of a parser and interpreter to which we may add new features incrementally, and we won’t get bogged down in the details of more complicated ideas.

So let’s decide what programs will look like!

To rigorously define what expressions will look like, we use a nice standard for descriptions called Extended Backus-Naur Form (EBNF). For those familiar with the theory of computation, this is very closely related to the notations for defining context-free grammars. EBNF notation is quite easy to learn by imitation, and we will teach it by example. Our first attempt at a syntax tree might look something like this:

expr = number
     | (+ number number)

Here the pipe symbol, |, should be mentally replaced with the word “or.” For now, every program will consist of a single “expr,” which is short for expression. In words, an expression can either be a number, or a sum of two numbers. For the sake of this language, we allow “number” to be anything that Racket considers a number (we will get to this soon), and we ignore additional whitespace in between these tokens.

Finally, one notices the odd placement of parentheses and the plus symbol. The parentheses represent an application of a binary operation, and the position of the plus is a notational standard called Polish notation. These aspects of the syntax may seem odd at first, but they happen to make our lives much simpler by eliminating all the hard parts of parsing expressions. And, of course, every logician knows the advantage of Polish notation: there is absolutely no ambiguity in how to read an expression. These subtle details rear their ugly heads in compiler design, and we may come back to them in the distant future. But for now we ask that the reader accept it, and get used to it, because it turns out the entire Racket language (and Lisp family of languages) is based on this notation.

Here are a three examples of well-formed programs we could write in this early version of Chai:

7
(+ 14 9)
(+   4   10 )

Unfortunately, there are some very simple programs we cannot yet write in Chai. For instance, the following program does not fit into our specification:

(+ 1 (+ 2 3))

Rigorously speaking, this form is not included in the above EBNF program syntax, because the arguments to a sum may only be numbers, and not other expressions. With the obvious self-referential modification, we fix this.

expr = number
     | (+ expr expr)

Now we may chain additions indefinitely. With the appropriate extension, we could easily extend this to include all binary operations we desire. For instance:

binop = + | - | * | / | ^ | < | ...

expr = number
     | (binop expr expr)

We will do this in the future, but for now, let’s translate these two syntactic forms into Racket code.

Define-Type, and Type-Case

In order to implement our syntax tree, we’d like some sort of internal representation for a language form. Of course, we could just package everything as a list of strings, but our foresight tells us this would get old quick. A structured solution is relatively straightforward; and the code itself turns out to look just like our EBNF tree, but with some additional names for things. In short: we want a datatype that represents each language form, and encapsulates the types of its arguments. In the language of compilers, such a datatype is called an abstract syntax tree (sometimes abbreviated to AST). Here is our first AST for Chai, implemented in Racket.

;; a chai expression
(define-type chai-expr
  [num (val number?)]
  [sum (lhs chai-expr?) (rhs chai-expr?)])

We explain the notation for those of us not so familiar with Racket: the semicolons start a comment line, which is not part of the program. The variable names in Racket (also called identifiers) allow for many characters that other languages prohibit: hyphens, slashes, question marks, etc., are all valid to use in identifiers. So for those experienced with other languages, don’t mistakenly think these are subtraction operations! Finally, the (define-type chai-expr …) language form defines a new type called ‘chai-expr’, and the following sub-expressions are subtypes. For Java and C/C++ users, this is essentially a shorthand for a bunch of public classes which all inherit the chai-expr interface (an empty interface, that is). The define-type form was specifically designed for creating these abstract syntax trees, and it does a pretty good job of it. Here, each subtype looks like

[type-name (field-1 type-1?) (field-2 type-2?) ...]

where ‘type-name’ and ‘field-j’ are identifiers for all j, and ‘type-j?’ are functions which accept one argument, and return true if the argument is of a specific type. For instance, ‘number?’ is a function which accepts one argument and returns true if and only if the argument is a number (as defined by Racket). For user-defined types, Racket creates these ‘type?’ functions automatically. In addition, Racket creates functions for us to create new instances of these types and access their fields.

This is easier to explain with examples. For instance, I could create some objects and then access their various fields:

> (define y (num 1))
> (num-val y)
1
> (define x (sum (num 2) (num 3)))
> (sum-lhs x)
(num 2)
> (num-val (sum-rhs x))
3

So ‘num-val’ accesses the ‘val’ field of the ‘num’ type, and so forth for each type we create. This is fine and dandy, but most of the time we won’t know whether our given piece of data is a ‘num’ or a ‘sum’ type. Instead, we will just know it is a ‘chai-expr’ and we’ll have to figure out which subtype it corresponds to. Luckily, Racket has our back again, and provides us with the ‘type-case’ language form. We might use it like this:

> (define my-expr (num 5))
> (type-case chai-expr my-expr
    [num (val) (string-append 
                 "I got a num! It was "
                 (number->string val))]
    [sum (lhs rhs) "I got a sum! This is rad!"]
    [else "Getting here is an existential crisis."])
"I got a num! It was 5"

The first argument (for us, ‘chai-expr’) describes which type to inspect, the second is the argument of that type (here, a ‘num’ object), and the subsequent [bracketed] clauses provide cases for each subtype one wants to consider, optionally with an ‘else’ clause which we included superfluously. The Racket interpreter determines which clause is appropriate (here, the ‘num’ clause), and binds the actual arguments of the input (in this case, 5) to the parenthetic field identifiers (in this case, (val)), so that one may use them in the following expression (here, the call to string-append).

As it turns out, define-type and type-case is all the machinery we need to get things working. But before we continue, I should mention that these two functions are not native to Racket. In fact, they come from a package called plai, and they were created using Racket’s nice system for macros. In other words, in Racket one can write new language forms for Racket! We won’t cover those here, but any programming enthusiasts out there might have a lot of fun with exploring the possibilities therein.

The Parser

Once we can translate an expression into branches of our abstract syntax tree, we will find that writing the actual interpreter is extremely easy. So let’s do the hard part first: parsing.

Of course, our choice of Racket-like syntax was in part because parsing such a syntax is relatively easy. In particular, Racket has a nice function that takes a string and converts it into a list of symbols, strings, and numbers (and other certain primitive data types). For instance,

> (read (open-input-string "(hello (there! 7 (+ 2 4)))"))
'(hello (there! 7 (+ 2 4)))

Here the leading quote is shorthand for Racket’s “quote” function. The official name for a quoted expression (and what “read” outputs) is s-expression. An s-expression is either a simple value (number, string of characters, boolean, or symbol), or a list of s-expressions. All words within a quoted expression are interpreted as symbols For more on quote, see the Racket Guide’s section on it.

So with an appropriate call to read, we can take a user’s input string and get an organized list of values. In our source code, we implement a more complex read function, which is available in the “chai-utils.rkt” source file on this blog’s Google Code page. With some additional checks to ensure there is exactly one expression to be read, we call this function “read/expect-single.” Its details are decidedly uninteresting, but if the reader is curious, one may find its internals displayed in the aforementioned source file. Similarly, we wrote a function called “expr->string” which accepts an s-expression and prints it out as a string.

As the reader might anticipate, once we have our inputs in the form of a list, parsing becomes a recursive cake-walk. Specifically, we would start with a shell of a function:

;; parse-expr: s-expr -> chai-expr
;; parse an s-expression into a chai expression, or throw an error
;;  if it is not well-formed
(define (parse-expr input-expr)
  (cond [(number? input-expr) <do something>]
        [(list? input-expr) <do something>]
        [else (error (string-append
                       "parse: unsupported language form: "
                       (expr->string input-expr)))]))

So parse-expr accepts an s-expression, and spits out a well-formed chai-expression, defaulting to an error. Here we use the “cond” language form, which is the Racket analogue to “switch” in C/C++/Java. For everyone else, it allows us to string together a number of conditions without writing cumbersomely many nested if/then/else statements. In each branch of the cond, we narrow down what our possible expression could be. If it satisfies number?, then our expression is a num, and if it satisfies list?, it is likely a sum. From here filling in the <do something> parts is easy: we simply construct the appropriate types:

;; parse-expr: s-expr -> chai-expr
;; parse an s-expression into a chai expression, or throw an error
;;  if it is not well-formed
(define (parse-expr input-expr)
  (cond [(number? input-expr) (num input-expr)]
        [(list? input-expr) (sum (parse-expr (second input-expr))
                                 (parse-expr (third input-expr)))]
        [else (error (string-append
                       "parse: unsupported language form: "
                       (expr->string input-expr)))]))

Here, “second” and “third” extract the second and third elements of the list, and parse-expr recursively evaluates the arguments (as we noted above when we said a sum had the syntax (+ expr expr)). However, it appears we missed something big: what if the list doesn’t have three elements in it? Someone could try to run the program “(square 7),” expecting a result of 49. This is certainly not a sum, but as of now our parser doesn’t make any distinctions. So we need to add a few more checks. Here is the complete parser, with all appropriate conditions checked and combined using the “and” function:

;; parse-expr: s-expr -> chai-expr
;; parse an s-expression into a chai expression, or throw an error
;;  if it is not well-formed
(define (parse-expr input-expr)
  (cond [(number? input-expr) (num input-expr)]
        [(and (list? input-expr)
              (eq? (first input-expr) '+)
              (eq? 3 (length input-expr)))
         (sum (parse-expr (second input-expr))
              (parse-expr (third input-expr)))]
        [else (error (string-append
                       "parse: unsupported language form: "
                       (expr->string input-expr)))]))

We check to make sure the first thing in the list is the symbol ‘+, and that the list has exactly three elements. Then we can be sure that the user meant to put in a sum.

The parse-expr function, along with our following “interp” function and the tests for both functions, will be stored in the “chai-basic.rkt” source code file on this blog’s Google Code page.

The Interpreter

At this point, we’ve parsed our expressions into the chai-expr datatype, and so now a simple application of type-case is all we need to interpret them. Indeed, the interp function practically writes itself:

;; interp: chai-expr -> number
;; interpret a chai-expression
(define (interp input-expr)
  (type-case chai-expr input-expr
    [num (val) val]
    [sum (lhs rhs) (+ (interp lhs) (interp rhs))]))

We don’t need to do any more conditional checking, because we know that anything fed to interp is well-formed. Later, specifically once we add variable references, interp will become much more interesting.

Here the plus function is Racket’s plus. Of course, it seems a bit silly to use + to interpret +, but remember that the point of this series is not to write a language from the ground up, but to get things rolling as quickly as possible, so that we may analyze the more interesting features of programming language semantics. Simply put, arithmetic is boring. We include it simply for familiarity, and because it makes good fodder for writing test cases to ensure our interpreter acts as it should.

Finally, we add one additional function which executes the entire parse/interp chain:

;; evaluate/chai: string -> any
;; perform entire the parse/interp chain
(define (evaluate/chai input) (interp (parse input)))

And (at the top of our source file), we “provide” the function so that other Racket programs may access it, specifically with the command (require “chai-basic.rkt”). All of our interpreters in this series will adhere to the same externally-facing interface.

(provide evaluate/chai)

So there we have it! If the reader downloads the source files, he can interpret expressions through Racket’s interactive interpreter. Additionally, the reader is invited to visit our website, where we have set up a program to receive and evaluate chai-programs through the internet. In the future we will store all of our online interpreters here, so one will be able to access Chai at all stages of its development.

Next time, we will finish off a full set of arithmetic operations, and start looking at variables. Until then!