The Journey Begins
We want to design a programming language!
That is certainly a lofty ambition, but one might ask, why bother? There are programming languages already out there that are way better than anything one mathematician-slash-programmer could do. Surprisingly, the answer is not so that we can have a programming language! Instead, we follow our inner mathematicians; we wish to study the structure of programming languages.
By doing so, we will learn how and why programming languages work, and we might even get some experience writing programs along the way. For the non-programmers out there, don’t be afraid. The process of designing a language is gradual, and as a result we’ll repeatedly practice the most basic features of our language, and then slowly add more features until we have a lean mean number-crunching machine. If one has a bit of programming knowledge already, then the journey through this series of posts should provide a good intuition for other programming languages, helping one determine the right questions to ask when a program runs amok. For the mathematicians, this series will stray a bit from our theoretical comfort zones, but we will use mathematical tools from computing theory to describe and analyze languages and programs. So this series should have a little something for everyone.
A Rough Outline
In order to study programming languages we need to study the issues which are common to all languages, and then decide how to customize those to fit our needs. At the highest level, a programming language has two parts: syntax and semantics. The logicians out there are already familiar with such words, but we will define them here.
A language’s syntax is the set of rules for producing well-formed programs. Syntaxes show up everywhere. In the language of arithmetic, the expression “1+” is not well-formed, because we broke the syntactical rule of the addition operator having a number before and after it. In the language of currency, one cannot lend a friend “5$2.01” dollars, because we all know the dollar-sign goes at the beginning (or end) of a number. For programs, our syntax will consist of rules for arithmetic, binding variables, constructing functions, and a number of other constructions with names we will learn later. Every form of communication has a syntax, and so forming syntactically correct statements is a prerequisite for conveying their meaning.
Second, a language’s semantics are the rules for evaluating well-formed programs. For us, this will include the precedence rules of arithmetic (and other) operations, variable substitution methods, scope, types, and many other issues. The study of semantics is mathematically the richest area of programming language theory. Researchers define and compare semantic models, and describe their relationships with underlying mathematical concepts, such as categories or models (both very rich and popular contemporary fields). For the majority of this series, we will stay away from such a rigorous mathematical analysis, mainly because the author is not familiar enough with these topics (yet). In any case, we will spend most of our time worrying about how to implement the correct semantic rules in our language.
And that raises the obvious question: how do we “make” a programming language do anything? The answer is (of course) we write a program! Specifically, we will design our new language, which we designate as the source language, by writing a program to interpret it. This program, which we henceforth dub an interpreter, will be written in a language we are familiar with, which we designate as the implementation language. In other words, our programming language will be completely determined by what the interpreter does to well-formed programs. If we design it correctly, then this will coincide with our theoretical idea of what it should do.
Note that writing an interpreter is one of two possible approaches. We could instead write a program called a compiler that translates our source language into machine code, which is specific to a physical computer. After compiling, the user may then ditch the compiler, and use the compiled code as she wishes (with our approach, she needs the interpreter every time she wants to run the program). Compilers have a number of other, more technical problems associated with them (see our post on Register Allocation [upcoming]), but it has the advantage of producing programs which run quickly. Since we are only interested in studying how languages are built, and not in running fast programs, we can live without a compiler and forget its issues. In the future, we may do another series on designing a compiler, during which we focus on the special challenges faced down that deeper rabbit hole.
So as we start to design our programming language, we will follow a recipe for each new feature:
- Add new syntax forms
- Decide how those forms should be interpreted (determine their semantics)
- In code:
- Accept a program in textual form
- Translate the textual form into an intermediate representation, while checking that it is well-formed (parse the input)
- Interpret the intermediate representation (run the program)
- Output the results to the user
As we write these interpreters, we will make the source code available on this blog’s Google Code page. However, as a special treat, we will also set up a web server for the reader to practice writing programs! In fact, this web server is already up and running. Later, that page will have links to forms where the user can type in a program, hit “submit,” and the server will run the program and report the result. This will save any non-programmers the trouble of downloading and installing Racket (our implementation language) and running the interpreter from its source. (Of course, the user is always welcome to do so. Racket is a wonderful language to use!)
We Almost Forgot the Most Important Part!
Before we do anything else, we need a name for our language. The names of popular programming languages are generally short, slick, and easy to pronounce. While the Java programming language started a coffee theme which would also be appropriate for mathematicians, this author loves tea, and we find it sad that no popular languages are named after tea. Hence, we name our new language “Chai.”
To sum our discussion up, we are about to embark on a mission: we will write a Chai interpreter in Racket, and learn about programming languages along the way. Next time we kick things off with the simplest version of Chai, one that exclusively features arithmetic. Until then!
It seems to me that writing your own programming language requires that you already have an existing programming language laying around to interpret it. So, how did people write the first programming languages, when there were no familiar languages laying around to interpret their new language?
So way back in the early days of computers there was only one kind of language: assembly language. This was essentially the instructions that were etched into the computer itself (either via vacuum tubes or transistors or whatever), and programmers had to (almost literally) program in binary.
Once the first programming language came around (pretend it was Fortran), programmers wrote the first Fortran compiler in assembly, and made just enough features for it to be Turing complete. After that, they would write another Fortran compiler in their basic version of Fortran. From then on they could write ever more complex Fortran compilers for the same language. This process of writing the first compiler in a different language, and subsequent compilers in the source language, is called bootstrapping. I believe it’s still used today in compiler design.
One can imagine this was a grueling process to being with, which is part of why old-timers in computer science get so much respect. They had to pay way more attention to detail than we young whipper-snappers!
I don’t think that Fortran compilers were written in Fortran in the early days. At least, IBM (which invented Fortran) didn’t do that. As late as the mid-1960s, IBM hired a firm named Digitek to create its System/360 Fortran G compiler, and they used a custom compiler-writing language for this purpose. Certainly Fortran IV (Fortran G was extended Fortran IV) would have been a horrible language for writing a compiler, there was no character data type, for example (you did weird type-cheating with integers).
I’m not sure who did the first X-in-X compiler. Lisp compilers were written in Lisp, from the earliest days; possibly the language Neliac, published in the early 1960s, was one of the first, as well. (Neliac was an offshoot of ALGOL 58, developed under the auspices of the U.S. Navy; writing Neliac in Neliac resulted in a somewhat portable compiler. Maurice Halstead wrote a book on it; the book is long out of print, but I recollect that some or all of it was scanned, and exists somewhere on the Intertubes.)
The first genuinely practical, high-profile X-in-X was Martin Richards’ BCPL compiler, first published in 1969. By design, the language and the compiler were both highly portable (especially to word-oriented machines), and the language spread widely. Ritchie and Thompson’s C language and compiler were strongly influenced by BCPL; in their ACM Turing Award lecture, they demonstrate how you can hide a trojan horse in an X-in-X implementation in such a way that no trace of it exists at the source code level.
Talk about a wealth of knowledge! I admit that, being the young whipper-snapper that I am, I used Fortran facetiously because it’s the oldest language I’ve heard of, and I know absolutely nothing of its history. But are you saying that most compilers were written in special compiler-writing languages? I would hazard that the compiler-writing languages must’ve had compilers written in assembly.
Technical question aside, I’d just like to thank you for the wonderful bit of history, not to mention that your comment on the trojan horse concept has already led me to read a number of articles on the subject, the first of which is the wikipedia page (for others reading this comment). Absolutely amazing 🙂
My pleasure, back before I retired, my students used to complain that I spent too much time talking about `history’ if I happened to mention that at one time computers were made out of thingies called `transistors’. History is useful, if only to find mistakes that were so horrible that people want to repeat them 🙂
As for the implementation language for most compilers, it was, as you say, mostly done in assembly language. I’m sorry if I gave the impression that it was otherwise. In most cases, no other software tools existed, and of course programmers were trying to fit code into microscopic memory spaces, so code was very carefully hand-optimized.
I don’t know much about Digitek’s technique, but it appears to have been semi-interpretive. They had high hopes for it, but it sort of fell apart when they tried to do the PL/I compiler for Multics using their technique. Digitek’s compiler never materialized, but the Multics team wrote their own subset compiler using TMG, a compiler-compiler language that I could never make sense of. The story of the Multics compiler is told at http://www.multicians.org/pl1.html, and there are some notes on TMG at http://www.multicians.org/tmg.html.
On the subject of Fortran-in-Fortran compilers, I had forgotten LRLTRAN, an extended Fortran developed at Lawrence Radiation Lab; they published a paper on it in CACM in 1969. As I recall, LRL wrote one or more OSes in it.
One other influential X-in-X compiler was Wirth’s Pascal-P system (circa 1974), which compiled Pascal to an interpretive virtual language called PCODE. Thus a kit with the source for Pascal-P and Pascal-P compiled to PCODE, when supplemented with a PCODE interpreter, would give you Pascal on any machine. (There may be traces of PCODE in Microsoft’s current Visual Basic systems, they used the same technique.)