Dynamic Memory Allocation based on OS Virtual Memory API

Disclaimer: I’m not an operating system expert. If anything I’m talking about any bullshit below, please leave a comment.

Motivation

Entity Database

In Usagi Engine, the memory used by Entity Database is allocated from a Pool Allocator. The Database uses the memory for two key components of the architecture: the metadata of Entities and the data of Components.

Both of the Entities and Components are managed in Pages (not the same page in the virtual memory) whose size is set by a parameter of the Database at compile time. This means whenever a new Entity is created on a new page, the storage of the metadata of all the Entities on that page is also allocated. It also means that if all the Entities on one Page does not have a certain type of Component, then no storage is allocated for that Component. However, if one Entity has that Component, then the storage is allocated for all Entities on the same page, including the Entities with no Component at all.

The underlying assumption for this design is that: the Entities created by a certain System should be observed to experience similar modifications through their lifetimes.

When a System is invoked to perform its update task, it usually iterates a range of Entities filtered by some Entity View, which internally holds a pair of iterators of the pages inside the Pool Allocator. This is fine as long as no new Entities are being inserted. However, if the insertion happens, it may cause the Pool Allocator to expand the underlying buffer, causing the iterators we use to visit the Entities to become invalid. However, this is not going to happen for inserting Components, because the Components are always accessed by indices.

Pool Allocator

To be clearer, the Pool Allocator uses std::vector to store the pages. Besides the problem of the validity of iterators, the performance impact of the expansion of the vector may deserve more attention.

The C++ standard merely states that the vector container must have amortized constant time for insertion without mentioning the exact strategy:

26.3.11.1 Class template vector overview [vector.overview]

A vector is a sequence container that supports (amortized) constant time insert and erase operations at the end; insert and erase in the middle take linear time. Storage management is handled automatically, though hints can be given to improve efficiency.

C++17 Standard Draft N4713

Standard library implementations such as Microsoft’s C++ Standard Library and libc++ meet the requirement by doubling the size of the underlying buffer when the size of the vector grows out of its capacity. When this happens, the vector class first allocates a new buffer with the double of the original size and use move semantic (when supported) to copy the old elements to the new buffer. The old buffer will be deallocated after the success of the copying.

The problem is that if we are trying to meet the soft real-time requirement in most games, which is to meet the deadline of presenting the frame, we cannot let the expensive allocation-copying-deallocation operation happen at unpredictable time, even it provides amortized constant time complexity for the insertion, because it will certainly cause a miss of the vertical sync and cause the player to feel a lag and be unhappy.

While C provides the malloc(), free(), and realloc() functions, C++ only adds two operators new and delete for supporting object-oriented programming. There is nothing to manipulate raw buffers in a C++ way. Despite that the std::allocator used by vectors allocates uninitialized storages, it doesn’t provide any mechanism to reallocate or change the size of existing storage, neither.

Even we have such an interface to perform the reallocation, it’s not enough because we want the references to the elements to remain valid after the expansion of the storage because we are reallocating the storage while iterating it. This means the base address of the buffer must not change. If it changes, not only we must update the current pair of iterators, the other Systems with ongoing iterations may also access the wrong memory. It will practically forbid the parallel execution of multiple Systems.

Sure we can defer the insertion of Entities until all executions of Systems are finished. But doing this requires extra memory management and performance bottleneck if the insertion amount is large.

Better Approach?

There is a reason that game developers tend to not use the standard library: the abstractions come at a cost. So how about bypassing the standard library altogether and exploit the virtual memory system of the operating systems directly? The operating system has full power to control the mapping between the virtual addresses used in our program the physical memory they may consume. If we can manipulate the mapping, we can expand the buffer at virtually no cost. Now instead of allocating a chunk of memory like std::vector does, now we allocate a range of virtual memory addresses from the operating system without consuming the actual physical memory and only inform the operating system of the range we are going to actually use.

Virtual Memory

Modern operating systems usually support Memory Overcommitment and Demand Paging, which provide the foundations of our allocator design.

Memory Overcommitment

Overcommitting means the size total memory reserved by all the processes exceeds the size of the physical memory, sometimes plus the size of the swap space. This is an invention to increase the utilization of the physical memory since usually not all processes will use all the memory they reserve so that other processes can take advantage of those free spaces.

Note that overcommitment might not be available on some systems, so we better only reserve the virtual addresses without committing the memory.

Demand Paging

Demand paging means that the operating system allocates the physical pages lazily. When we allocate a chunk of memory, that range of virtual addresses is recorded as used, thus contributes to the size of committed memory. However, no actual physical memory is used until the program actually accesses the corresponding page. When the access happens, it causes a page fault that is handled by the operating system and a physical page is allocated for it.

Strategy

Let’s review our objectives. We want a memory allocator which:

  • Expands and shrinks dynamically with a low cost
  • Never changes the base address of the buffer
  • Only consumes physical memory when necessary

The memory allocator sounds just like a stack allocator. It is. But we explicitly inform the operating system about the usage of memory.

Virtual Address Space

The first decision to make is the size of the address space we are going to reserve. On a 64-bit machine, the virtual address is quite a plentiful resource. It is both 128 TB for 64-bit Linux and Windows.

Meanwhile, a typical consumer computer today has 16~32GB memory (some rare monsters have 64GB and more). Comparatively, we shall never be afraid of running out of the address space.

Since we are going to use the allocator with the Entity Database, let’s do some calculations. Suppose we have each Component with 512 bytes in size, which is quite large, and 1024 types of Components in total. If we allocate 1GB for each component, only 1TB address space will be used. With 1GB = 1024 * 1024 * 1024 bytes, we are able to store 1024 * 1024 * 2 = 2,097,152‬ Components, which is more than enough.

The Allocator Design

The allocator will just reserve a range of virtual addresses during initialization and when it is requested to change the size, it notifies the operating system of the new used range and decommit the unused range.

Implementation on Linux

In Linux, the virtual address space is divided into Virtual Memory Areas (VMAs). This information is stored in the internal process structure as both a doubly-linked list and a red-black tree. Linux provides mmap() family of functions for managing virtual memory. The function creates a mapping of a virtual address range, which corresponds to a VMA, so that when the access to any address within that range, a page fault will cause the handler to allocate a physical page for it.

In Linux, a sysctl parameter vm.overcommit_memory controls the policy regards overcommitting. If the setting is to disable overcommitting, we cannot reserve enough address space unless the PROT_NONE flag is specified, which prevents the allocation from being accounted into the commit size. If overcommitting is allowed, we can instead specify the MAP_NORESERVE flag. This time the allocation contributes to the commit size, but it is essentially ignored by the kernel whether there will be enough space for it.

There are some related links regarding reserving memory:

We can learn from the last link that we can call mmap(PROT_NONE) to decommit a range of an existing mapping and mprotect(PROT_WRITE | PROT_READ) to commit the memory. The email also claims that madvise() isn’t effective in decommitting memory.

During any mmap() family of operations, the kernel will first search the VMA tree and locate the corresponding entry before the operation. This indicates that the execution time of mmap() should grows logarithmically with the allocation number. However, this is only my guess. The mmap() also has a limit of allocation numbers controlled by vm.max_map_count, so you cannot do that infinitely.

To better control the performance hit caused by page faults, we can also supply the MAP_POPULATE flag to mmap() when committing a range of addresses. It’s called prefaulting the pages, which cause physical pages to be allocated for the address ranges immediately.

It is also worth noting that Linux has the ability to relocate a range of mapped addresses using mremap(), which may cause the base address to change if the old location cannot fit the bigger size. But this is currently irrelevant to us.

Another note is that the pages with PROT_NONE access could be used as a guard page in order to avoid buffer overflow.

Implementation on Windows

Interestingly, committing and decommitting memory is comparatively easier on Windows since it provides two flags MEM_COMMIT and MEM_RESERVE for its function for allocating virtual memory, VirtualAlloc(), to explicitly state whether you want to commit the memory immediately. If we only want to reserve the memory, we can use VirtualAlloc(MEM_RESERVE, PAGE_NOACCESS). To decommit a range of memory, VirtualFree(MEM_DECOMMIT) does the job. Windows finally has something neater this time!

The VirtualAlloc() function internally calls NtAllocateVirtualMemory(), which is a system call inside ntoskrnl.exe (the other part of the kernel is win32k.sys, which handles the GUI subsystem, btw). From its implementation, we can see that it also operates on a tree structure, but an address allocation record is called a Virtual Address Descriptor (VAD) this time. If you want to read neater source code, the source code for ReactOS, which is trying to be a free and open-source alternative of Windows, might be a good choice. They essentially do the same thing regarding the allocation.

Unlike in Linux, once a range of virtual memory is allocated in Windows, there is nothing like mremap() or such to move it to another location or change its size. You must be determinative about your allocation size.

Another difference is that there isn’t a limit on how many VirtualAlloc() you can perform. But the more you do, the slower it gets.

Bitmap Allocator?

Since we can selectively commit and decommit a range of memory now, it seems that we can make a bitmap allocator which when releasing a chunk of memory, just returns the physical page to the operating system. Therefore instead of having the fragmentation problem of the memory, we let the memory addresses fragment instead.

Maybe one day we will get back to this topic.

Some Extra Links

Yukino

1 thought on “Dynamic Memory Allocation based on OS Virtual Memory API

Leave a Reply

Your email address will not be published. Required fields are marked *