Updates on the MM

As I posted in my previous blog entry work had begun to improve JNodes memory management. We had sucessfully implemented our initial design of using a single heap instead of multiple linked heaps, and utilizing a special structure for effecient storage of small objects. The problem I found after the new system was working was the ability to tell wether or not anything was actually improved, especially when I started to reduce the memory managers locks to a finer grained level. In theory, the allocation speed would be greatly increased, but we had no simple way of telling as multiple attempts at getting benchmarking applications to run on JNode had failed. Without this it was impossible to tell what sort of affect our new system was having on the runtime performance of JNode, which is as much or more important than the allocation speed of the memory manager.

As such, i have began to develop just that, a benchmarking suite aimed at testing different parts of JNode. The suite is being developed with no ties directly to JNode, utilizing only the Java API so that it can be run on other JVMs to give a reasonable comparison of how JNode is performing against other JVM implementations. Although I am calling it a benchmark suite for now, my goal is to design it in such a way that any sort of test can be created and run with it, including regression testing, something JNode also needs.

Apart from that, we have had some non-performance observations of the new implementation. One thing that became quickly clear to myself was that neither the linked list of heaps nor the single main heap were helping the more serious issues of the MM.

1) Locality of Reference
2) Lock Contention
3) Data Isolation

Locality of Reference is extremely important to the performance of applications, especially with often used information. The reason why cpu cache, which is relativly small, gives so much more performance is based on the observation that many applications spend the largest amount of their runtime executing a small subset of their instructions. This is also true of an applications data. In order to make the best utilization of cache, often used data should be held as sequential as possible. Our biggest issue at the moment is the fact that all threads across the system utilize the same heap into which their data is allocated. If a single thread is allocating, then this is not so much of an issue, but when multiple theads wish to allocate multiple objects concurrently, each threads data becomes interleaved with other threads data, causing a sort of fragmentation and spreading out each threads data over a larger address space. Given this, it would seem that we need changes to the allocation pattern of the memory manager so as to provide some seperation.

Synchonization is a major issue with the MM. With the current implementation we have a very safe environment to develop in, the lock covers the allocation completly, so there is little to no risk of multithreading bugs. But this is obviously detrimental to the performance of allocations. By isolating the critical section of an allocation down to a few instructions within the heap, which does the actual allocation, returning an address to which the object can be instantiated, we can limit the amount of time the lock is held. But this still leaves a constant bottleneck. A more intuitive approach would be to have a design that eliminated the need for locking in the most common case. Again, as with Locality of Reference, it would seem that if some seperation were provided than the contention for locks would be reduced.

All of this leads up to my final point, and that is proper data seperation. In modern operating systems any application that is executed exists in its own little world. Each process has its own heap for its information. One process cannot directly access the information within another process. Most every application is designed in this way. At an application level, it would seem that this would be simple. Any thread that is allocating within a given application will allocate to the space provided within its own heap. In theory this would prove helpful to points 1 & 2, but what about information that is not allocated by an application but is still used by individual applications. While the locking bottleneck would be considerably less for allocations performed by applications, you still have the same bottleneck in the OS threads as they would all be allocating into a single 'System' heap. Having thought about it awhile, it seemed that a partial solution would be to make sure that threads that utilized the system heap would be concious of their memory usage and did not create excessive amounts of garbage by constantly allocating new objects in memory, but rather making use of Factory type implementations to re-use as much as possible.

In JNode we have the Isolate to help with application level data seperation. Any threads that attempt to allocate memory while running within an Isolate will allocate to an address that exists within the Isolates heap. All other allocations will go to the system heap. Although this does provide a cleaner seperation, we need to keep in mind that not all data an application may use will be allocated by it, some of this data will be created by the OS. Also there may be data allocated by an application that the OS needs to access. This is more important from a garbage collection point of view. Not only does this help improve locality of reference by keeping application specific data less fragmented, it also reduces lock contention. Synchronization would be required when heaps grow, aquiring blocks of memory from the MemoryBlockManager. Within an Isolate we still have the issue of multi-thread multi-object allocation, though in general it is not as big of an issue as it is on a global operating system level.

The last observation I would like to talk on is the FixedSizeHeap(FSH). As explained previously the FSH was an attempt to seperate the large numbers of small objects from the small number of large objects. The idea was to reduce the amount of fragmentation caused when garbage collection was performed and you ended up with small fragments of free space only large enough to fit small objects into. After time this would lead to degrading performance on allocations, in theory. What i came to notice when analyzing the data was that many of these small objects were created and referenced from within collections, most commonly reference arrays. From this point of view it would seem that such a structure would in some way help the locality of reference of object collections. But in another way, maybe not so common or obvious we were potentially hurting locality of reference. If a group of small objects of various sizes were created there was a great potential for their address to be very far apart. It became apparent that while the structure of the FSH was quite usefull to objects referenced from collections, it had a negative impact on objects that fit within the size requirements but were not within collections. A common case we could take advantage of is the creation of a reference array that is populated with references to newly created objects. When the reference array is created, a portion of memory would also be 'soft' allocated in anticipation that it will be soon populated with objects that are referenced from the array. In this sense, a reference array would behave, in memory, more like a primitive array. Where as currently it behaves more like a linked list, in that its data could be spread out over a much larger address space than necessary. Of course, with the reduction in multi-thread multi-object allocation fragmentation that would be seen by using Isolates and per-Isolate heaps, the benefits of this structure would be reduced. In a single-threaded Isolate, or any Isolate where the large amount of allocation is done by a single thread, then the data will be as sequential as the application developer designed it to be. Perhaps with a better design overall, there will be little need for the FSH structure except in some possible edge cases.