[EN] Writing a Compiler is Surprisingly Easy (part 1)

Ever since I was a teenager I wanted to create my own systems programming language. Such a programming language would certainly have to be compiled to native code, which meant I'd have to write a compiler.

Even though I managed to write several half-working parsers, I'd always fail at the stage of generating assembly code, as the task turned too complex.

In this blog I intend to show my teenage self how writing a code generator is, in fact, not complex at all, and it can be fully done in a couple of weekends. (As long as we make some simplifying assumptions)

I will that assume you already know that representing programs within a program is surprisingly easy

Where do we even start?

The goal today is to translate a high level language to x86-64 machine code. Our high-level language will have integer literals, variables, negation, and addition.

To make our job simpler, we will translate the high level code to x86-64 assembly code, and use a pre-existing assembler (like the one inside GCC) to translate that to machine code.

Let's get started. Here is how we will represent integer literals:

struct int_literal {
	int value;
};

Compiling them is straightforward:

void compile_int_literal(struct int_literal* e) {
	printf("mov $%d, %%rax\n", e->value);
}

printf uses the % character to denote formatting commands, so a double %% is used to print the character verbatim.

Here I used the x86-64 mov instruction. This instruction moves data from one place to another. In the notation used in this blog, the flow of data is left-to-right (i.e. the instruction is works as mov source, destination)

In this case we are moving an immediate value (a numeric constant written directly in the code, notated with a $ preceding it) into the "a" register (a little space of data that lives directly inside the CPU, written as %rax)

The x86-64 mov instruction can be used in several ways ("addressing modes"): immediate-to-register (the one we just used), register-to-memory, memory-to-register, register-to-register, etc. We will see some of these later on.

I want to note that, at this point, many design decisions have already been made. Most notably:

Our compiler will output AT&T-style assembly code to stdout, using printf
The result of compiling an expression will be assembly code that computes and stores the value of that expression in %rax

The last point is in no small part responsible for making the compiler so simple, but it also means that the code we'll generate is not very efficient.

For the initiated, some obvious things that we could add to mitigate this are a peephole optimizer and a register allocator. But these would force us to implement some intermediate representation and make our compiler a lot more complicated, so we are not going to.

Let's move on.

How can we implement variables?

The way we will represent variables in our compiler is with an integer. Yes, not with a string that corresponds to the name of the variable, but an index that represents where that variable will live in memory. Indeed, all variables will be stored in our computer's main memory. In particular, they will live in a place called the stack.

The stack is a wonderful mechanism that allows us to quickly store some values in memory and safely discard them once we don't need them. In memory, it is simply laid out like an array of 64 bit integers.

The way it works is extremely simple: the CPU has a register -- known as the stack pointer %rsp -- that always tells us the first slot in the stack that is available to be used. Every slot after will be free, and every slot before will be occupied. As with everything low-level, this doesn't happen magically, and we will be responsible for keeping that register up to date.

We are going to be a bit naughty and not update the stack pointer for now. This is mostly okay as long as our code does not perform any function calls.

With this in mind, we can represent variables using an index that tells us how many slots *after* the first available slot the variable is stored.

struct variable {
	int slot;
};

In a more complete compiler there would be a previous step that takes care of assigning a stack slot to each declared variable, and translating variable names to stack slots. We wouldn't expect the language user to type in the slots by hand.

Compiling them is simple enough, but we do need to learn some more x86-64 assembly to fully grasp it.

void compile_variable(struct variable* e) {
	int slot = e->slot;
	printf("mov %d(%%rsp), %%rax\n", -8 * (slot + 1));
}

Here we use a different version of the mov instruction. This one looks like mov number(register1), register2, and it means "take the value at address number+register1, and store it in register2". (mem-to-reg mode)

This is how we implement access into the stack. The stack pointer holds the address of the first free slot, and we add an offset to to access the slot that holds the desired variable.

Besides that, there is a funny looking bit of math up there. There are two things to know about the stack:

each slot holds a 64-bit value. This equals 8 bytes, which is why we multiply by 8
the stack grows downwards. This means that the occupied slots are at higher addresses than the free slots. This explains why the 8 is negative instead of positive
there is a +1 in there because I lied. The stack pointer actually points at the last occupied slot, instead of the first free slot.

Negation

The representation of negation is similarly straightforward, and should be apparent to those who read the previous blog.

struct negation {
	struct expression* target;
};

I will omit the definition of struct expression for now, as it would distract from the main points of this section.

To compile negations we will take advantage of the fact that compiling an expression will produce code that stores its result in %rax. This means that if we emit some code that negates %rax right after the code that computes the value to be negated, we will have succesfully computed the desired negation.

Luckily, x86-64 has an instruction that does just this.

void compile_negation(struct negation* e) {
	compile_expression(e->target);
	printf("neg %%rax\n");
}

But just to get used to the kind of puzzles we have to solve to compile some more complicated operations, let's avoid using neg. x86-64 has a sub instruction that subtracts one operand from another. Let's try using that.

In particular, let's:

compute the target expression (leaving the result in %rax)
put a zero into %rcx
substract %rax from %rcx (leaving the negated result in %rcx)
put the negated result in %rax

void compile_negation(struct negation* e) {
	compile_expression(e->target);
	printf("mov $0, %%rcx\n");
	printf("sub %%rax, %%rcx\n");
	printf("mov %%rcx, %%rax\n");
}

We use %rcx instead of %rbx because %rbx is a callee-saved register in x86-64, meaning we are not allowed to use it without first storing its old content in the stack, and later restoring its value. This is mandated by the x86-64 calling conventions.
This is too cumbersome, so we just use the nearest non-callee-saved register.

While not too complicated, this ilustrates the kind of hoops we will sometimes have to jump through in order to compile more advanced operations. I don't think this makes compiling hard, but it can get pretty tedious.

Also, note the use of the reg-to-reg addressing mode

Addition

Like in the previous blog, we will represent additions as follows:

struct addition {
	struct expression* left_term;
	struct expression* right_term;
};

The general idea here will be to first compile the left term, then the right term, then emit an add instruction that adds their results. The main problem that arises is that the result of the the left term will be lost while we compute the right term.

A simple fix might be to move the result to a different register, like this:

void compile_addition(struct addition* e) {
	compile_expression(e->left_term);
	printf("mov %%rax, %%rcx\n");
	compile_expression(e->right_term);
	printf("add %%%rcx, %%rax\n");
}

Unfortunately, this will not work when the right term also stores something in %rcx in an intermediate step. Instead, we will get some help from to our good friend, the stack.

We will store that intermediate value in the stack, and read it back after the right term is done computing. If we take care that the code we generate for the right term doesn't write into the same stack slot, then we can be sure that the value will be preserved.

The mechanism to prevent re-using the same stack slot is a simple counter. Since we also store variables in the stack, its value must be greater than any slot that's been assigned to a variable. For now let's just initialize it with some large number, like 10.

int temp_counter = 10;
void compile_addition(struct addition* e) {
	compile_expression(e->left_term);
	int slot = temp_counter++; // allocate a new slot
	printf("mov %%rax, %d(%%rsp)\n", -8 * (slot + 1));
	compile_expression(e->right_term);
	printf("add %d(%%rsp), %%rax\n", -8 * (slot + 1));
	temp_counter--; // restore the counter
}

Here we finally see a mov that uses reg-to-mem addressing mode. Also, note that add can also use mem-to-reg mode.

Putting it All Together

The final piece of the puzzle, for now, is the struct expression data type, and its corresponding compile_expression function. These are pretty much trivial given what we've seen so far, but I'll type them out for completeness' sake.

enum expression_tag {
	EXPRESSION_INT_LITERAL,
	EXPRESSION_VARIABLE,
	EXPRESSION_NEGATION,
	EXPRESSION_ADDITION,
};
struct expression {
	enum expression_tag tag;
	union {
		struct int_literal as_int_literal;
		struct variable as_variable;
		struct negation as_negation;
		struct addition as_addition;
	};
};

void compile_expression(struct expression* e) {
	switch (e->tag) {
	case EXPRESSION_INT_LITERAL:
		compile_int_literal(&e->as_int_literal);
		break;
	case EXPRESSION_VARIABLE:
		compile_variable(&e->as_variable);
		break;
	case EXPRESSION_NEGATION:
		compile_negation(&e->as_negation);
		break;
	case EXPRESSION_ADDITION:
		compile_addition(&e->as_addition);
		break;
	}
}

Testing it out

At this point, we are capable of compiling simple arithmetic expressions. To test this out, we can write a small program like this one:

int main() {
	// var0 + (-var1 + 42)
	struct expression* e =
		addition(
			variable(0),
			addition(
				negation(variable(1)),
				int_literal(42)));
	compile_expression(e);
}

int_literal, variable, negation and addition are some helpers that build up the corresponding expressions.

Which produces the following output:

mov -8(%rsp), %rax
mov %rax, -88(%rsp)
mov -16(%rsp), %rax
neg %rax
mov %rax, -96(%rsp)
mov $42, %rax
add -96(%rsp), %rax
add -88(%rsp), %rax

Then, to be able to run it, we can just add some assembly that will take two arguments and store them in stack slots 0 and 1, and return after executing the code.

.global foo
foo:
	mov %rdi, -8(%rsp)
	mov %rsi, -16(%rsp)

	mov -8(%rsp), %rax
	mov %rax, -88(%rsp)
	mov -16(%rsp), %rax
	neg %rax
	mov %rax, -96(%rsp)
	mov $42, %rax
	add -96(%rsp), %rax
	add -88(%rsp), %rax

	ret

Finally, we hook into this code from C, and check that it returns the right thing:

#include <stdio.h>
#include <stdint.h>
int64_t foo(int64_t a, int64_t b);
int main() {
	for (int i = 0; i < 10; ++i) {
		for (int j = 0; j < 10; ++j) {
			printf("expected: %d, got: %ld\n", i-j+42, foo(i, j));
		}
	}
}

To do this we compile using GCC and run it in a terminal:

$ gcc main.c foo.s -o main
$ ./main
expected: 42, got: 42
expected: 41, got: 41
expected: 40, got: 40
... and so on ...

Conclusion

Writing a compiler is not as hard as it seems if we are willing to keep it simple. If we avoid introducing complexity ourselves, its main source is understanding the target architecture, and not the compilation process itself.

In the next parts we will look at how to compile classic control flow constructs such as if and while, as well as function calls and pointers.

About me

My name is Sebastian. I'm a competitive programmer (ICPC World Finalist) and coach. I also enjoy learning about programming language theory and computer graphics.

Social links:

Search This Blog

Sebastian Mestre's Algorithmic Adventures