[EN] Representing Programs Within Programs is Surprisingly Easy

Suppose that you want to create a brand new programming language that will change the world. Or that you want to auto-formatter that gets indentation just right. Or maybe you want to write a pre-processor for an existing language, to hack some missing feature into it.

Whatever it may be, there are many reasons to write programs that manipulate other programs as data and the first step is to figure out how you are going to represent that data within the program.

My goal in this article is to explain one of the most standard approaches (Abstract Syntax Trees) and show how they are actually really easy to understand and implement.

This article has many code examples in the C language.

What kind of language are we defining?

Let's start with a simple example: a language with variables, numbers, and addition. Note that this language does not have statements, but only expressions that evaluate to simple values. This means that we will be able to represent expressions such as 0, 1+1, x+y, and even a+(b+1).

We will use different data types for each type of expressions. These data types need to store just enough information to encode the behavior of the expression, and not it's full textual representation.

i.e. we will store only the information that is needed to execute this code, but not enough to reconstruct the corresponding string.
This is a design decision. If you were writing a code formatter, you would also need to store that kind of information!

Let's start with integer literals. What kind of information do we need? That's right. We only need to store the actual value of the literal, as that is enough to encode the behavior of such an expression.

Then, the most straightforward representation possible would be something like this:

struct int_literal {
	int value;
};

And that is completely sufficient for our purposes (in fact, this representation can get you quite far, even as far as a full compiler).

For a variable, let's say its name is enough to represent it. We could store more information, like whether it's a global variable, a pointer to its declaration (if our language had such things), etc., but let's try to keep it simple here.

struct variable {
	char* name;
};

Now, the last piece of the puzzle is addition. The sum of two other expressions. And what is it that we need to store? Well, it's precisely those two other expressions.

Here is when we run into a bit of a problem. An addition could have any type of expression as its operands. It could be a literal plus a literal, a variable plus a variable, combination of the two or, more crucially, it could have other additions as operands.

To handle this, we need some data type that can represent any kind of expression. Since we don't have one right now, let's pretend we do, and then try to invent it.

struct addition {
	struct expression* left_term;
	struct expression* right_term;
};

This is often the bit that is confusing to most people. We want addition to have two fields, which can be of variety of different types depending on runtime-provided data (perhaps we could say the fields are polymorphic), and the type of the fields will sometimes be an addition itself (perhaps we could say the data is recursive).

To implement this so-called struct expression, we need to design a data type that can hold either a struct int_literal, a struct variable or a struct addition.

This might seem impossible, but let's see how this is actually pretty easy to do, even in C. A neat way to do that in C is using a tagged union.

Unions in C

A union in C is just a chunk of memory large enough to store any one of a set of data types. For example if we had a type A that is 10 bytes large, and a type B that is 15 bytes large, the union of those two types would be 15 bytes.

The idea is that an object of such a type will sometimes hold a value of type A and sometimes hold a value of type B. Let's see a contrived example:

// 12 bytes
struct a {
	char data[12];
};

// 8 bytes in a 32-bit machine, 16 bytes in a 64-bit machine
struct b {
	void* x;
	void* y;
};

// sizeof(union c) is the greater of sizeof(struct a) and sizeof(struct b)
// 12 bytes in a 32-bit machine, 16 bytes in a 64-bit machine
union c {
	struct a as_a;
	struct b as_b;
};

Tagged Unions

Although an object of type union c will sometimes hold a struct a and sometimes a struct b, there is no way to query which one it's holding at runtime. For this, we add an auxiliary tag value, that will store this information.

union c {
	struct a as_a;
	struct b as_b;
};

enum c_tag { C_A, C_B };

struct tagged_c {
	enum c_tag tag;
	union c data;
};

Constructing one of these values is straightforward, if a bit long-winded.

struct tagged_c make_a(void* x, void* y) {
	struct tagged_c result;
	result.tag = C_A;
	result.data.as_a.x = x;
	result.data.as_a.y = y;
	return result;
}

struct tagged_c make_b(char* data) {
	// ... analogous code ...
}

Then, we can write code that processes either As or Bs by looking at the tag:

void process(struct tagged_c* data) {
	switch(data->tag) {
	case C_A:
		process_a(data->data.as_a);
		break;
	case C_B:
		break;
		process_b(data->data.as_b);
	}
}

Back to expressions

This knowledge can be applied directly to expressions. We just need to create a tagged union that can hold any one of the three expression types.

enum expression_tag {
	EXPRESSION_INT_LITERAL,
	EXPRESSION_VARIABLE,
	EXPRESSION_ADDITION,
};
struct expression {
	enum expression_tag tag;
	union {
		struct int_literal as_int_literal;
		struct variable as_variable;
		struct addition as_addition;
	};
};

Here I used some syntactic sugar to avoid declaring a separate union type.
This will also let us access the as_xxx fields directly instead of having an intermediate data field as before.

To make the construction of these values convenient, I recommend writing some helpers like the following:

struct expression* addition(struct expression* left_term, struct expression* right_term) {
	struct expression* result = malloc(sizeof(struct expression));
	result->tag = EXPRESSION_ADDITION;
	result->as_addition.left_term = left_term;
	result->as_addition.right_term = right_term;
	return result;
}
struct expression* int_literal(int value) { /* analogous code */ }
struct expression* variable(char* name) { /* analogous code */ }

Then, we can create expressions freely:

// 0
struct expression* example1 = int_literal(0);

// 1+1
struct expression* example2 = addition(int_literal(1), int_literal(1));

// x+y
struct expression* example3 = addition(variable("x"), variable("y"));

// a+(b+1)
struct expression* example4 = addition(variable("a"), addition(variable("b"), int_literal(1)));

How do we use these expressions?

As alluded to earlier, these expressions are a recursive data type. This means that it will usually be most convenient to use recursion when processing expressions.

For example, let's write a simple expression printer.

void print(struct expression* e) {
	switch (e->tag) {
	case EXPRESSION_INT_LITERAL:
		printf("%d", e->as_int_literal.value);
		break;
	case EXPRESSION_VARIABLE:
		printf("%s", e->as_variable.name);
		break;
	case EXPRESSION_ADDITION:
		printf("(");
		print(e->as_addition.left_term);
		printf("+");
		print(e->as_addition.right_term);
		printf(")");
		break;
	}
}

While this can take a while to get used to, it must be noted that most recursion on ASTs is quite mechanical. It usually it comes in form of "do some preprocessing, recurse over all subexpressions, do some postprocessing", and the code ends up being quite uniform, so it can be picked up quite quickly, even without fully understanding recursion.

But does this scale?

This basic design can be taken pretty darn far. It's not difficult at all to create new types of expressions and add them to the tagged union:

struct anonymous_function {
	// array of parameters
	int parameter_count;
	char** parameters;

	struct expression* body;
};

struct function_call {
	struct expression* target;

	// array of arguments
	int argument_count;
	struct expression** arguments;
};

enum expression_tag {
	// .. old tags ..
	EXPRESSION_FUNCTION_CALL,
	EXPRESSION_ANONYMOUS_FUNCTION,
};
struct expression {
	enum expression_tag tag;
	union {
		// .. old types ..
		struct anonymous_function as_anonymous_function;
		struct function_call as_function_call;
	};
};

And we can easily extend any function that uses the tagged union, like our printer from before:

void print(struct expression* e) {
	switch (e->tag) {
	// .. old cases ..
	case EXPRESSION_ANONYMOUS_FUNCTION:
		printf("fun (");
		for (int i = 0; i < e->as_anonymous_function.parameter_count; ++i) {
			if (i > 0) printf(", ");
			print(e->as_anonymous_function.parameters[i]);
		}
		printf(") => ");
		print(e->as_anonymous_function.body);
		break;
	case EXPRESSION_FUNCTION_CALL:
		print(e->as_function_call.target);
		printf("(");
		for (int i = 0; i < e->as_function_call.argument_count; ++i) {
			if (i > 0) printf(", ");
			print(e->as_function_call.arguments[i]);
		}
		printf(")");
		break;
	}
}

Conclusion

Although programs have the less-common feature of being recursive in nature, this can be handled simply and in a mechanical fashion. In my opinion, this implies that representing programs within programs is no more difficult than any other data-modelling task.

I hope this blog will encourage you to write programs that manipulate programs!

While we have shown how representing programs in programs is fairly easy we still have some questions to answer, which I tackle in other blogs:

How do we get an AST out of a plain-text program? (Writing parsers is surprisingly easy, not out yet)
How do I compile a program in AST form to assembly? (Writing a compiler is surprisingly easy)

About me

My name is Sebastian. I'm a competitive programmer (ICPC World Finalist) and coach. I also enjoy learning about programming language theory and computer graphics.

Social links:

Search This Blog

Sebastian Mestre's Algorithmic Adventures