[EN] Writing a Compiler is Surprisingly Easy (part 1)
Ever since I was a teenager I wanted to create my own systems programming language. Such a programming language would certainly have to be compiled to native code, which meant I'd have to write a compiler.
Even though I managed to write several half-working parsers, I'd always fail at the stage of generating assembly code, as the task turned too complex.
In this blog I intend to show my teenage self how writing a code generator is, in fact, not complex at all, and it can be fully done in a couple of weekends. (As long as we make some simplifying assumptions)
I will that assume you already know that representing programs within a program is surprisingly easy
Where do we even start?
The goal today is to translate a high level language to x86-64 machine code. Our high-level language will have integer literals, variables, negation, and addition.
To make our job simpler, we will translate the high level code to x86-64 assembly code, and use a pre-existing assembler (like the one inside GCC) to translate that to machine code.
Let's get started. Here is how we will represent integer literals:
struct int_literal { int value; };
Compiling them is straightforward:
void compile_int_literal(struct int_literal* e) { printf("mov $%d, %%rax\n", e->value); }
printf
uses the%
character to denote formatting commands, so a double%%
is used to print the character verbatim.
Here I used the x86-64 mov
instruction. This instruction moves data from one
place to another. In the notation used in this blog, the flow of data is
left-to-right (i.e. the instruction is works as mov source, destination
)
In this case we are moving an immediate value (a numeric constant written
directly in the code, notated with a $
preceding it) into the "a" register (a
little space of data that lives directly inside the CPU, written as %rax
)
The x86-64 mov
instruction can be used in several ways ("addressing modes"):
immediate-to-register (the one we just used), register-to-memory, memory-to-register,
register-to-register, etc. We will see some of these later on.
I want to note that, at this point, many design decisions have already been made. Most notably:
- Our compiler will output AT&T-style assembly code to
stdout
, usingprintf
- The result of compiling an expression will be assembly code that computes and
stores the value of that expression in
%rax
The last point is in no small part responsible for making the compiler so simple, but it also means that the code we'll generate is not very efficient.
For the initiated, some obvious things that we could add to mitigate this are a peephole optimizer and a register allocator. But these would force us to implement some intermediate representation and make our compiler a lot more complicated, so we are not going to.
Let's move on.
How can we implement variables?
The way we will represent variables in our compiler is with an integer. Yes, not with a string that corresponds to the name of the variable, but an index that represents where that variable will live in memory. Indeed, all variables will be stored in our computer's main memory. In particular, they will live in a place called the stack.
The stack is a wonderful mechanism that allows us to quickly store some values in memory and safely discard them once we don't need them. In memory, it is simply laid out like an array of 64 bit integers.
The way it works is extremely simple: the CPU has a register -- known as the
stack pointer %rsp
-- that always tells us the first slot in the stack that is
available to be used. Every slot after will be free, and every slot before will
be occupied. As with everything low-level, this doesn't happen magically, and we
will be responsible for keeping that register up to date.
We are going to be a bit naughty and not update the stack pointer for now. This is mostly okay as long as our code does not perform any function calls.
With this in mind, we can represent variables using an index that tells us how many slots *after* the first available slot the variable is stored.
struct variable { int slot; };
In a more complete compiler there would be a previous step that takes care of assigning a stack slot to each declared variable, and translating variable names to stack slots. We wouldn't expect the language user to type in the slots by hand.
Compiling them is simple enough, but we do need to learn some more x86-64 assembly to fully grasp it.
void compile_variable(struct variable* e) { int slot = e->slot; printf("mov %d(%%rsp), %%rax\n", -8 * (slot + 1)); }
Here we use a different version of the mov
instruction. This one looks like
mov number(register1), register2
, and it means "take the value at address
number+register1
, and store it in register2
". (mem-to-reg mode)
This is how we implement access into the stack. The stack pointer holds the address of the first free slot, and we add an offset to to access the slot that holds the desired variable.
Besides that, there is a funny looking bit of math up there. There are two things to know about the stack:
- each slot holds a 64-bit value. This equals 8 bytes, which is why we multiply by 8
- the stack grows downwards. This means that the occupied slots are at higher addresses than the free slots. This explains why the 8 is negative instead of positive
- there is a +1 in there because I lied. The stack pointer actually points at the last occupied slot, instead of the first free slot.
Negation
The representation of negation is similarly straightforward, and should be apparent to those who read the previous blog.
struct negation { struct expression* target; };
I will omit the definition of struct expression
for now, as it would
distract from the main points of this section.
To compile negations we will take advantage of the fact that compiling an
expression will produce code that stores its result in %rax
. This means that
if we emit some code that negates %rax
right after the code that computes the
value to be negated, we will have succesfully computed the desired negation.
Luckily, x86-64 has an instruction that does just this.
void compile_negation(struct negation* e) { compile_expression(e->target); printf("neg %%rax\n"); }
But just to get used to the kind of puzzles we have to solve to compile some
more complicated operations, let's avoid using neg
. x86-64 has a sub
instruction that subtracts one operand from another. Let's try using that.
In particular, let's:
- compute the target expression (leaving the result in
%rax
) - put a zero into
%rcx
- substract
%rax
from%rcx
(leaving the negated result in%rcx
) - put the negated result in
%rax
void compile_negation(struct negation* e) { compile_expression(e->target); printf("mov $0, %%rcx\n"); printf("sub %%rax, %%rcx\n"); printf("mov %%rcx, %%rax\n"); }
We use
%rcx
instead of%rbx
because%rbx
is a callee-saved register in x86-64, meaning we are not allowed to use it without first storing its old content in the stack, and later restoring its value. This is mandated by the x86-64 calling conventions.This is too cumbersome, so we just use the nearest non-callee-saved register.
While not too complicated, this ilustrates the kind of hoops we will sometimes have to jump through in order to compile more advanced operations. I don't think this makes compiling hard, but it can get pretty tedious.
Also, note the use of the reg-to-reg addressing mode
Addition
Like in the previous blog, we will represent additions as follows:
struct addition { struct expression* left_term; struct expression* right_term; };
The general idea here will be to first compile the left term, then the right
term, then emit an add
instruction that adds their results. The main problem
that arises is that the result of the the left term will be lost while we
compute the right term.
A simple fix might be to move the result to a different register, like this:
void compile_addition(struct addition* e) { compile_expression(e->left_term); printf("mov %%rax, %%rcx\n"); compile_expression(e->right_term); printf("add %%%rcx, %%rax\n"); }
Unfortunately, this will not work when the right term also stores something in
%rcx
in an intermediate step. Instead, we will get some help from to our good
friend, the stack.
We will store that intermediate value in the stack, and read it back after the right term is done computing. If we take care that the code we generate for the right term doesn't write into the same stack slot, then we can be sure that the value will be preserved.
The mechanism to prevent re-using the same stack slot is a simple counter. Since we also store variables in the stack, its value must be greater than any slot that's been assigned to a variable. For now let's just initialize it with some large number, like 10.
int temp_counter = 10; void compile_addition(struct addition* e) { compile_expression(e->left_term); int slot = temp_counter++; // allocate a new slot printf("mov %%rax, %d(%%rsp)\n", -8 * (slot + 1)); compile_expression(e->right_term); printf("add %d(%%rsp), %%rax\n", -8 * (slot + 1)); temp_counter--; // restore the counter }
Here we finally see amov
that uses reg-to-mem addressing mode. Also, note thatadd
can also use mem-to-reg mode.
Putting it All Together
The final piece of the puzzle, for now, is the struct expression
data type,
and its corresponding compile_expression
function. These are pretty much
trivial given what we've seen so far, but I'll type them out for completeness'
sake.
enum expression_tag { EXPRESSION_INT_LITERAL, EXPRESSION_VARIABLE, EXPRESSION_NEGATION, EXPRESSION_ADDITION, }; struct expression { enum expression_tag tag; union { struct int_literal as_int_literal; struct variable as_variable; struct negation as_negation; struct addition as_addition; }; }; void compile_expression(struct expression* e) { switch (e->tag) { case EXPRESSION_INT_LITERAL: compile_int_literal(&e->as_int_literal); break; case EXPRESSION_VARIABLE: compile_variable(&e->as_variable); break; case EXPRESSION_NEGATION: compile_negation(&e->as_negation); break; case EXPRESSION_ADDITION: compile_addition(&e->as_addition); break; } }
Testing it out
At this point, we are capable of compiling simple arithmetic expressions. To test this out, we can write a small program like this one:
int main() { // var0 + (-var1 + 42) struct expression* e = addition( variable(0), addition( negation(variable(1)), int_literal(42))); compile_expression(e); }
int_literal
,variable
,negation
andaddition
are some helpers that build up the corresponding expressions.
Which produces the following output:
mov -8(%rsp), %rax mov %rax, -88(%rsp) mov -16(%rsp), %rax neg %rax mov %rax, -96(%rsp) mov $42, %rax add -96(%rsp), %rax add -88(%rsp), %rax
Then, to be able to run it, we can just add some assembly that will take two arguments and store them in stack slots 0 and 1, and return after executing the code.
.global foo foo: mov %rdi, -8(%rsp) mov %rsi, -16(%rsp) mov -8(%rsp), %rax mov %rax, -88(%rsp) mov -16(%rsp), %rax neg %rax mov %rax, -96(%rsp) mov $42, %rax add -96(%rsp), %rax add -88(%rsp), %rax ret
Finally, we hook into this code from C, and check that it returns the right thing:
#include <stdio.h> #include <stdint.h> int64_t foo(int64_t a, int64_t b); int main() { for (int i = 0; i < 10; ++i) { for (int j = 0; j < 10; ++j) { printf("expected: %d, got: %ld\n", i-j+42, foo(i, j)); } } }
To do this we compile using GCC and run it in a terminal:
$ gcc main.c foo.s -o main $ ./main expected: 42, got: 42 expected: 41, got: 41 expected: 40, got: 40 ... and so on ...
Conclusion
Writing a compiler is not as hard as it seems if we are willing to keep it simple. If we avoid introducing complexity ourselves, its main source is understanding the target architecture, and not the compilation process itself.
In the next parts we will look at how to compile classic control flow constructs
such as if
and while
, as well as function calls and pointers.
About me
My name is Sebastian. I'm a competitive programmer (ICPC World Finalist) and coach. I also enjoy learning about programming language theory and computer graphics.
Social links:
Comments
Post a Comment