A new memory manager for JNode

Implementation for new memory manager has begun for JNode. The new design is being planned to lay a foundation for a more concurrent memory manager, and a more effecient garbage collection. In order to do this, the new memory manager and its heap designs are going to be developed in a way to allow the most concurrency through finer grained heap structures. I'll talk first though about the changes being made to go from JNode's current MM to the new MM.

From Old to New

The current MM in JNode utilizes a heap structure that is allocated in 16 MB blocks. The blocks are linked together to form a linked list, and grows as heap space is required. This was changed to give JNode a single large heap that occupies all of the available memory. The new MainHeap design still uses the allocation bitmap that was present with the old design. This bitmap is used by the GC to verify if a given value 'is an object'. This bitmap has 1 bit to represent every 8 byte alignment within the heap. All objects that are allocated are 8 byte aligned. I'll mention more on this when i talk about stack maps.

The new MM will also seperate objects by size. Objects that are larger than a given threshold size are allocated directly in the MainHeap, while objects below the threshold are allocated into array-like structures called FixedSizeHeaps(FSH). FSHs are created for each given object size, aligned upto 8 bytes, and are linked together to form a linked list for that particular object size. With a threshold of 32 bytes, there would be 5 lists of FSHs, 0/8/16/24/32 bytes. Aligning objects like this may not be space effecient, but for development, it is easier to make stable. In many other JVM/GC implementations it was often found that the heap would largely be populated (40-70%) with objects that ranged 0-32 bytes in size. The other observation was that this small subset of a heaps objects were also relativly short lived. And as such, generational GC seperation algorithms started being applied. Also, because this subset of small objects tend to have short lifespans, it brings alot of fragmentation. When a GC cycle is complete, this fragmentation can be more easily compacted.

With these new heap designs will also come a better synchronization model. Currently, the MM is single threaded. The new heap designs allow finer grained synchronization locking, greatly reducing the amount of time locks are held. For example, with the object size seperation, large objects can be allocated concurrently with small objects. Having the further seperation of object sizes (0/8/16/24/32) easily allows any allocations of different sizes to happen concurrently. We could even take this a step further, allowing concurrent allocation of objects of the same size, but for now we will not go this far. Synchronization brings its own realm of problems and bugs, much of which are very difficult to detect, especially in an unstable environment.

The last major portion that is being enhanced is the garbage collection. JNode's GC is implemented with a mark-sweep algorithm that fully Stops The World(STW). Thread scheduling is disabled while the gc runs, and is re-enabled when it is complete. This is required because in the mark-sweep algorithm it is the job of the mark phase to create a 'snapshot'. This snapshot tells the sweep phase what is garbage and what is not. If we were to allow other threads to run, then this snapshot would become altered. A Write Barrier is used to allow the runtime threads to update the snapshot that is being built during the mark phase. While this does bring overhead to the runtime threads, it is only active during the mark phase, which would otherwise require STW. The bonus for us is that the compiler part of the write barrier is already in place along with a WriteBarrier implementation. Our only job with this is making it work with the mark phase, then improving the WriteBarrier implementation.

As these design ideas become more stable, we will start to branch into changes that involve more of JNode's subsystem, like the scheduler and the compiler. The problem with trying to apply more effecient algorithms is the same problem that was faced when CPUs tried to perform floating-point calculations without an FPU. The JIT compiler is our FPU, and currently the only support we have from it is for the write barrier. The good part of it is, many of these changes that need to occur at the compiler level require much the same implementation as there is for the write barrier, except at different locations.

For now i'll leave it at this, as there is still much work to be done in order to make these ideas stable, before work can continue in other areas.