Proposal: Garbage collection
Sorry if this post should doesnt belong here, I couldn't find a better place to put it, perhaps if we have a proposal section so people can post proposals to be debated.
The main content of this message has already been posted to the 'help wanted' section but I was ignored so I'll try again . What i'm detailing here is a more complete solution.
A while ago I had a suggestion for garbage collection that relied on the MMU units of modern processors. EPR didn't like this for want of a simple/generic solution. So this is a second attempt.
Deep breath, here goes....
There are two goals of a OS Java GC:
1. Efficiency ideally less than 10% of execution time spent garbage collecting.
2. Interactivity. No good if the system halts for 5 minutes GC and preventing use of the system. This is obviously completely inappropiate for a desktop system.
I hope people agree with me so far. So what is the state of play with current algorithms? They mostly seem to be based on the Generational GC principle. That is only GC the memory allocated since the last GC unless the system is out of memory, in which case collect the generation before that and so on.
The problem with this is it delays the inevitable. Eventually a full system GC is required. This halts the system for a large amount of time. Also a simple write barrier is required on pointer writes. This is so the roots to the lastest generation can be detected.
Having said this generational GC is a good system that works in most environments, it's efficient and Modern VMs use generational GC. In my opinion generational GC would work very well with a OS VM if the pause time for the occasional full memory GC could be mitigated. This is what I've been concentrating on.
So lets interleave a slow running full system GC with normal program operation. The problem is the directed connectivaty graph of the program is constantly changing. By requiring to much co-operation between the mutator (running threads) and the GC slows down the mutator. You might end up with a fine grained GC algorithm with no pause times but the whole system would run slowly.
I see the answer as a compromise. Break the memory into chunks. The GC can halt the entire system while it collects the chunk. The bigger the chunk the bigger the pause time but the more efficient the overall system. The advantage of this approach is the integration of the mutator with the GC is very small. In fact no larger than would be required with a traditional generational GC algorithm.
So elaborating on the chunk idea, what is required is that we trap all pointer references between chunks. by doing this we have a set of all possible roots to a chunk. For efficiencies sake lets assume all chunks are the same size and are a power of 2. There are no gaps between the chunks and the first starts at byte n * chunksize. Location 0 is memory is a bad page to trap null pointer exceptions. what we're essentially talking about is memory blocks.
Its possible to trap intra-block pointers with three instructions on every pointer write. A smart compiler can cut down the times even this check is done by using the notion that pointer copying local to an object can never trigger the case. This assumes that objects never cross block boundaries. There are exceptions for this, for instance large objects bigger than the blocksize but these can be handled separately.
The code to trap intra block pointer writes looks like the following:
xor sourcePointer, destPointer
or result, blockSizeMinus1
jnz codeToHandleIntraBlockPointers
As people can see, including the jump its only three instructions on the x86 (I think!!!)
This only has to be triggered when a member pointer of one class is set. Not for intermediate local variables.
But is the pointers are in different blocks what does the code thats been jumped to do? My suggestion is to have a system wide array of pointers with as many elements in it as blocks. This would be amount of system memory / blocksize. Each one of these blocks points to say the last element in a linked list. We add the address of the source pointer to the block element that contains the destination pointer. The effect of this is each block now has a set of addresses of pointers pointing into it. Of course there can be duplicates, but the critical thing is this list is time ordered. Thats very important.
Now the elements in these lists do not get modified by the mutator after they are created. This means that a thread running in parrallel can scan these lists and do the following:
1. Remove obsolete references
2. Remove duplicates
This will trim the list down to the bare root set of a given block. The time taken to do the above is proportional to the size of the list which is directly proportional to the number of intra-block pointer writes. Essentially what we're doing is delaying the processing of the lists so we can process a longer list in one go. This increases the change of duplicates in the list and therefore can make the process a lot more efficient. We can also run the list scanning algorithms on different threads therefore processors and possibly schedule more at system idle time.
But how are duplicates in the list detected and obsolete references.
Firstly to detect duplicates. With modern architectures we generally cant locate an object across a machine word. This means for instance that on a 32 bit architecture which is 4 bytes the bottom 3 bits of all pointers will be zero. This could be used as an information store. When scanning the list the pointers can be marked. If a pointer is allready marked dont add it to the new list thats building.
Secondly what about obsolete references? This is simple, if the pointer points to some other block now, its obsolete.
So the algorithm so far is this. Run a paralled thread to the mutators to cut down the pointers into a certain block. This incoming pointer list for the block will keep growing so we can say as soon as a certain percentage of the list is processed all the mutators are halted. The next step is to process the remained part of the list for the block we are just about to garbage collect. Now contained in that list we should have a valid set of roots for the block. We can garbage collect the block and move all the reachable objects to the new heap thats growing. Fix up the roots to point to the new object locations and the old block is now ready to be recycled.
the more observant readers will have asked about intra-block cycles... but more to come....
- Login to post comments
Update
As Ewout kindly pointed out we have a proposals section. I've put the above message there and that will be its home, the URL is: http://jnode.sourceforge.net/portal/book/view/261