A New Garbage Collection Algorithm

A while ago I had a suggestion for garbage collection that relied on the MMU units of modern processors. EPR didn't like this for want of a simple/generic solution. So this is a second attempt.

Deep breath, here goes....

Introduction

There are two goals of a OS Java GC:

Efficiency ideally less than 10% of execution time spent garbage collecting.
Interactivity. No good if the system halts for 5 minutes GC and preventing use of the system. This is obviously completely inappropiate for a desktop system.

I hope people agree with me so far. So what is the state of play with current algorithms? They mostly seem to be based on the Generational GC principle. That is only GC the memory allocated since the last GC unless the system is out of memory, in which case collect the generation before that and so on.

The problem with this approach, is it delays the inevitable. Eventually a full system GC is required. This halts the system for a large amount of time. Also a simple write barrier is required on pointer writes. This is so the roots to the lastest generation can be detected.

Having said this, generational GC is a good system that works in most environments, it's efficient and modern VM's use generational GC. In my opinion generational GC would work very well with a OS VM if the pause time for the occasional full memory GC could be mitigated. This is what my proposal concerns.

Overview

So lets interleave a slow running full system GC with normal program operation. The problem is the directed connectivaty graph of the program is constantly changing. By requiring to much co-operation between the mutator (running threads) and the GC slows down the mutator. You might end up with a fine grained GC algorithm with no pause times but the whole system would run slowly.

I see the answer as a compromise. Break the memory into chunks. The GC can halt the entire system while it collects the chunk. The bigger the chunk the bigger the pause time but the more efficient the overall system. The advantage of this approach is the integration of the mutator with the GC is very small. In fact no larger than would be required with a traditional generational GC algorithm.

Trapping intra-block pointer writes

So elaborating on the chunk idea, what is required is that we trap all pointer references between chunks. by doing this we have a set of all possible roots to a chunk. For efficiencies sake lets assume all chunks are the same size and are a power of 2. There are no gaps between the chunks and the first starts at byte n * chunksize. Location 0 is memory is a bad page to trap null pointer exceptions. what we're essentially talking about is memory blocks.

Its possible to trap intra-block pointers with three instructions on every pointer write. A smart compiler can cut down the times even this check is done by using the notion that pointer copying local to an object can never trigger the case. This assumes that objects never cross block boundaries. There are exceptions for this, for instance large objects bigger than the blocksize but these can be handled separately.

The code to trap intra block pointer writes looks like the following:

xor sourcePointer, destPointer

or result, blockSizeMinus1

jnz codeToHandleIntraBlockPointers

As people can see, including the jump its only three instructions on the x86 (I think!!!)
This only has to be triggered when a member pointer of one class is set. Not for intermediate local variables.

Storing intra-block pointer writes

Pointers that are detected as being intra-block need to be stored for later analysis. What this document proposes, is to have a system wide array of pointers with as many elements in it as blocks. The size of this array would be determined by the following equation:

size of array = amount of system memory / size of block

Each element in the array corresponds to a block in memory. Each array element contains a list of pointers pointing to elements held in the corresponding block.

The address of the source pointer is added to the linked list pointed to by the array element that corresponds to the block that contains the destination pointer. The effect of this is each block now has a set of addresses of pointers pointing into it. Of course there can be duplicates, but the critical thing is this list is time ordered. Thats very important.

Now the elements in these lists do not get modified by the mutator after they are created. This means that a thread running in parrallel can scan these lists and do the following:

Remove obsolete references
Remove duplicates

This will trim the list down to the bare root set of a given block. The time taken to do the above is proportional to the size of the list which is directly proportional to the number of intra-block pointer writes. Essentially what we're doing is delaying the processing of the lists so we can process a longer list in one go. This increases the change of duplicates in the list and therefore can make the process a lot more efficient. We can also run the list scanning algorithms on different threads therefore processors and possibly schedule more at system idle time.

But how are duplicates in the list detected and obsolete references.

Firstly to detect duplicates. With modern architectures we generally cant locate an object across a machine word. This means for instance that on a 32 bit architecture which is 4 bytes the bottom 3 bits of all pointers will be zero. This could be used as an information store. When scanning the list the pointers can be marked. If a pointer is allready marked dont add it to the new list thats building.

Secondly what about obsolete references? This is simple, if the pointer points to some other block now, its obsolete.

So the algorithm so far is this. Run a paralled thread to the mutators to cut down the pointers into a certain block. This incoming pointer list for the block will keep growing so we can say as soon as a certain percentage of the list is processed all the mutators are halted. The next step is to process the remained part of the list for the block we are just about to garbage collect. Now contained in that list we should have a valid set of roots for the block. We can garbage collect the block and move all the reachable objects to the new heap thats growing. Fix up the roots to point to the new object locations and the old block is now ready to be recycled.

the more observant readers will have asked about intra-block cycles. The technique I use is to have two categories of reached objects:

strongly reached
weakly reached

The idea of strongly reached objects is that there was a proveable link to the object sometime during the last cycle. Weakly reached objects could either be junk or referenced. Strong reached objects are copied out to a new heap. Weakly reached objects can either be left of place or copied to a weakly reached heap. When we have no referenced from the strongly reached haep to anywhere else we know we can stop garbage collecting.

... more to come ...

Printer-friendly version
Login to post comments

A Suggestion

Submitted by devintraining123 on Sun, 04/13/2008 - 12:11.

May I suggest an algoritim porting the LZMA compression algorithim to JNode, compressing the files and then simply deactivating the code and placing it in a custom-made file shown as (obviously) the trash bin? And, when emptying, the code is destroyed entirely, and when restoring, the code is decompressed and reactivated?

I don't understand ...

Submitted by Stephen Crawley on Sun, 04/13/2008 - 12:52.

I don't understand what you are saying. Does it have anything to do with garbage collection?

Take at the MC2 collector.

Submitted by Stephen Crawley on Wed, 12/13/2006 - 21:28.

Here's a link to an article on the MC2 collector. It has a lot of characteristics that would make it good for JNode; incremental, low pause times, efficient space usage.

http://www.cs.umass.edu/~emery/pubs/04-15.pdf

There's also a 200 page book on MC2 listed on Amazon for ~$70.

http://www.amazon.com/High-performance-copying-garbage-collection-overhe...

Corrections ... and critique.

Submitted by Stephen Crawley on Sat, 10/07/2006 - 04:54.

Where this proposal talks about "intra-block pointers" and "intra-block cycles", I think it should say "inter-block". The prefix "intra-" means within, whereas "inter-" means between.

In GC terminology, a process that traps intra-block (sic) writes is called a "write barrier". These may be implemented in software (as above), using VM support hardware, or as part of a physical (or virtual) machine's instruction set; e.g. in microcode. (Obviously, the last approach is not open to us in JNode.)

The technique described for recording the intra-block (sic) writes seems to be a form of remembered set [Ungar], albeit with a rather inefficient implementation (if I understand it correctly). A better approach might be to implement the remembered sets using a "sequential storage buffer" [Hudson & Diwan].

But the real worries with this proposal are:

the percentage of inter-block pointers (and hence the cost) is hard to predict,
the description is very light on the issues of synchronizing mutators and collectors,
the vitally important issue of inter-block cycles is not fully explained.