yujiri.xyz /software/guide/allocation.gmi

Memory allocation

The main thing I want to explain here is the different ways a program can allocate memory. While I wanted to keep this tutorial code-free, as I tried to explain it I realized that this topic couldn't really be understood without seeing some code. So I'm going to show you some code in the C language. Also, before we get to memory allocation, I'm going to show you some more basic code samples, and I'll end up explaining lots of other things along the way, most of which will apply to all programming languages.

You'll need a C compiler installed. Many Linux distributions come with one preinstalled, but if yours doesn't, try installing a package called `gcc`. (This is the GNU C Compiler, the most commonly used one on Linux.)

Now, here's just about the simplest program you could write:

// Lines that start with '//' are called comments. They have no effect on the
// code; they're just for explaining things.

// This line tells the compiler to check the file 'unistd.h' for information
// about the basic system calls and how to use them. 'unistd' presumably stands
// for Unix standard - C people *love* their abbreviations.
#include <unistd.h>
// On Linux, unistd.h and many similar files are probably in /usr/include.
// If you're curious, open /usr/include/unistd.h in a text editor. It'll
// look like gibberish to you, but it contains the information the compiler
// needs in order to know how to generate machine code that makes system calls.

// This is the header for our *main function*, which is basically where the
// program starts. Don't worry about the 'int' part for now. I will explain
// it later, though!
// After this opening '{', everything until the corresponding '}' at the end
// of the file is part of the main function.
int main() {
	// Now we're going to write some text to the terminal with the 'write'
	// system call. You may wonder what the hell these numbers are, but don't
	// worry, I'll explain them in a minute.
	write(1, "hello\n", 6);
	// The '\n' here stands for "newline". Without it, whatever you write next
	// would be on the same line as 'hello'. Therefore, you generally want to
	// put \n at the end of things you print.
}

To see it in action:

Copy this source code into a text file and save it, with a name like `test.c`. (It's convenitional for source code files to always have an extension that indicates what language they are.)

Open a terminal, navigate to where your `test.c` is saved, and run this: `gcc test.c`. This will compile the program into an executable. `test.c` won't be touched, but the executable will be saved as a separate file called `a.out` (the name is for historical reasons).

Run it: `./a.out`.

Okay! Now that you've seen it, it's time for me to explain some more about the source code, starting with the weird numbers given to `write`.

The first number is the file descriptor you want to write to. 1 means stdout. (0 would mean stdin, and 2 would mean stderr.) Note that this code never called `open` to get a file descriptor; you don't have to do that for the three standard streams.

The second number, the 6, is the length, measured in bytes, of the data you're writing ("hello\n").

Now you might wonder, why the hell do you have to tell it the length when you already told it the data you want to write? Why can't it just look at it and see how long it is? Explaining this is going to take a while but it's important. Let's start here: what do you think "hello\n" really is from the computer's perspective?

If you've learned about ASCII, you know that it's 6 bytes in a row: the byte for h, the byte for e, the byte for l, the byte for l again, the byte for o, and the byte for \n.

But it's not just those bytes. You see, because of the way machine code works, function parameters basically have to have a fixed size. But `write` needs to be able to work with different amounts of data. If it were designed to take a list of 6 bytes to write, you wouldn't be able to use it to write "goodbye\n". So, it can't just take the data directly. Instead what it takes is a *pointer* to the data.

A pointer is actually a number, but it doesn't get used for math - instead, it gets interpreted as a memory address that some data is stored at. Imagine your computer's memory as a long row of numbered boxes, each one of which holds one byte. The "hello\n" is stored across 6 consecutive boxes. If the 'h' is in box number 4294967296, the 'e' is in box number 4294967297, ..., and the '\n' is in box number 4294967301.

As for where I got the number 4294967296? It's just a big number I made up. The C compiler will pick a memory address to store your data at; it doesn't really matter what it picks, which is why the C language doesn't expose that detail to you.

So, the actual middle parameter you're passing to `write` is the memory address that the first byte of "hello\n" is stored at. That tells it where your data starts, but not where it ends. That's why you have to also tell it the length.

Recap: when you call `write(1, "hello\n", 6);`, you're telling it to go to some memory address, take 6 bytes starting from there, and write them to the file descriptor 1.

Phew. Now you're probably getting a feel for how difficult low-level programming is, and why people want high(er)-level languages, where you can just write `print("hello")` without worrying about memory addresses and file descriptor numbers. To be fair, programming in C isn't quite as hard as I've made it look: there is an easier way to print things than this. I just wanted to show you how the raw syscall works first.

There's a couple more fundamental coding concepts I want to show you, that will apply to all languages.

Variables

A variable is a memory location that you store some data in, and you can change that data (hence why it's called variable). They're used basically everywhere in all languages. I'll show you how to create, use, and modify a variable.

// We're using a different #include this time, because I'm going to use that
// "easier way to print things" I mentioned instead of directly using the
// write syscall. The easier way comes from a file called stdio.h (short for
// standard IO; IO means input/output).
#include <stdio.h>

int main() {
	// This creates a variable named 'number'. The 'int' stands for integer,
	// which tells the compiler that we're going to use this variable as an
	// integer (whole number).
	int number;
	// This stores the value '5' in number.
	number = 5;
	// Here's the easier way to print things. It uses the write syscall behind
	// the scenes. The %d means we want it to print the decimal representation
	// of `number` instead of interpreting it as an ASCII character.
	printf("%d\n", number);

	// Now let's change the value and print it again.
	number = 6;
	printf("%d\n", number);

	// You can also create a variable and give it a value at the same time:
	int other_number = 7;
	printf("%d\n", other_number);
}

Run this the same way (paste it into a text file like `test2.c`, compile it with `gcc test2.c`, and run the resulting executable with `./a.out`.) You should see 5, then 6, then 7.

Functions

In source code, a *function* is basically a compartmentalized, named part of a program. It usually does something specific and reusable. For example, that `printf` thing is a function; its purpose is to print things to stdout without the programmer having to manually figure out the length, and to let you use it on things other than pointers.

Let's create our own function that adds two integers together, but prints them as it's doing it.

#include <stdio.h>

// I'm calling the function 'visible_add' because it adds things while showing
// it to you. The 'int a' and 'int b' indicate that this function takes two
// parameters, called a and b, and both of them are ints.
// Also, the 'int' at the beginning means that this function has a *result* or
// 'return value', which is also an int (and in our case will be the sum of the
// two numbers it adds).
int visible_add(int a, int b) {
	// We use the %d thing twice because we have two numbers we want it to print.
	printf("adding %d + %d\n", a, b);
	// This returns the value a + b to wherever this function was called from.
	return a + b;
}

int main() {
	// Call the 'visible_add' function with 5 and 6 as its parameters, and
	// store its return value in a new variable called 'result'.
	int result = visible_add(5, 6);
	printf("result is %d\n", result);

	// We can call visible_add as many times and in as many places as we want,
	// without having to write out all of the code inside it every time.
	result = visible_add(2, 80);
	printf("result is %d\n", result);
}

Run this one!

Exit status

Brief aside to answer this question: if the 'int' at the beginning of visible_add means that it returns an int, why does main also have 'int' at the beginning? Because it also returns an int, just implicitly. When the main function has no 'return' statement in it, C compilers basically insert `return 0;` at the end of it.

Whatever the main function returns is considered the 'exit status' of the program, and is given to the operating system to indicate success or failure. Usually, a program exiting with 0 means it succeeded, any other number means it failed. Some programs use different non-0 numbers to indicate different kinds of failure. Other programs just always exit 1 if they fail.

Now, it's finally time to talk about memory allocation. A running program's memory is divided into a few sections. When you need to store a piece of data in memory, there are generally three of these sections you can choose from, with different properties: the stack, the data segment, and the heap.

The stack

The stack stores information about currently-running functions. The information stored about each one is called a stack frame and includes things like the values of variables inside that function, and where the function was called from, so the program knows where to return to when the function finishes. In the sample above, the stack frame for visible_add stores the values of `a` and `b`, and the stack frame for main stores the value of `result`.

When the program starts, only the main function is on the stack. When it encounters a call to visible_add, it pauses main and pushes a stack frame for visible_add onto the stack, then starts running visible_add. When visible_add returns, it pops (removes) the stack frame for visible_add and returns to main, right where it left off.

All the code you've seen so far has exclusively used the stack.

The program's command-line arguments and environment variables are also stored in this area of memory, just below the main function's stack frame. I'm not going to show how to access them here though, because that's beside the point and would be specific to C - other languages do it differently.

The stack has a fixed maximum size, but that size is platform-dependent. Generally you can assume it's big enough that you don't have to worry about it, but in rare situations (such as an infinite recursion bug, which leads to the program endlessly allocating more stack frames) a program can crash with a "stack overflow", which means it ran out of stack space.

Fun tidbit: stack overflow is the namesake of the most popular Q&A site for programming.

The data segment

The data segment is for things that exist outside of any functions. It is initialized before the program starts. Things stored there are called 'global', and any function can access them. That also means that if any function changes them, the change will affect the entire program.

Here's a demonstration:

#include <stdio.h>

int data_segment_thingy = 5;

// 'void' means this function doesn't return anything.
void blah() {
	data_segment_thingy = 6;
}

int main() {
	// Show the value of data_segment_thingy before and after calling blah.
	printf("%d\n", data_segment_thingy);
	blah();
	printf("%d\n", data_segment_thingy);
}

Observe that calling blah changes the value of data_segment_thingy, and that change affects main.

The heap

The heap is another section of a running program's memory. Like the stack and unlike the data segment, things on the heap get created inside functions, not when the program starts. But like the data segment and unlike the stack, things on the heap *don't* get deleted when the function that created them ends.

And unlike both the data segment and stack, the heap can grow at run-time without any particular limit (other than the whole computer's memory capacity). If you try to allocate heap memory and there isn't enough space, the program will do a syscall to ask the operating system for more memory.

In low-level languages like C, when you allocate space on the heap, you have to manually free it later. If you don't, you have a bug called a memory leak, which means your program's memory usage goes up and never goes down. If the memory leak is in a part of the code that runs repeatedly, this can mean the program hogs more and more memory the longer it runs.

In higher-level languages, the language will manage the heap for you, automatically freeing things when it detects they're no longer used, but this usually means some performance overhead (the program has to do complex analysis while running to figure out what pieces of data are no longer used). This automatic heap management is called garbage collection.

I'm not going to bother showing you how to use the heap in C, because it's beside the point, but you can look it up if you're curious or if you choose to learn C.

Pros and cons

What are the pros and cons of each place you can allocate memory?

Data segment:

Fast, because the space is made when the program starts and never needs to be moved or copied.
Things here must have a fixed size.
Things here never get freed.
Different parts of the program can't have their own, separate copies of things in the data segment.

Stack:

Fast, because the program automatically knows where to put it (top of the stack), and it's right next to the other things the program is probably accessing at the same time, which improves performance (a concept called cache locality).
Things here must have a fixed size.
Things here are freed when the function ends, so they can't outlive that function (except through return values, which copies them).
Every function call makes a new stack frame, so there's no risk of different parts of the program overwriting each other's variables.
The stack has a limited size, so you can't store very large things on it.

Heap:

Slow, because the program has to do complex logic at run-time to find a place on the heap to fit your data, and may have to do a syscall to get more memory. Things here also consume a little extra space because the program has to also keep track of where each thing is and how big it is.
Things here can be resized.
Freed when you free them. In low-level languages, this makes the heap harder to use and very mistake-prone because you're responsible for freeing things. In high-level languages, it makes the heap even slower because the program has to do complex analysis while running to figure out when to free things. However, this makes the heap the most flexible.

The text segment

This isn't a place you can allocate memory, I just thought that after explaining the other 3, I should explain the last major section of process memory.

The 'text' segment contains the machine instructions of the program. It is of course a misnomer because machine instructions are not human-readable text, but that's what it's called.

Segmentation fault

A segmentation fault or segfault is when a program tries to access memory that the operating system hasn't assigned to it. Usually this results in the operating system killing the program abruptly, which makes debugging hard because there's no error message or any indication of which line of code caused the problem.

In low-level languages, it's easy to get this to happen by creating a pointer that points to an address you haven't allocated anything at, and then trying to dereference (follow) that pointer. It often happens by accident when you use the heap, because it's easy to forget that some part of the program is still holding a pointer to something and free it prematurely. When you free memory, the operating system takes it back, so trying to use that memory after may result in a segfault.

In high-level languages, this basically never happens, unless you're using a library written in a low-level language.