Reproducible GC failure

Project:JNode Core
Component:Code
Category:bug report
Priority:minor
Assigned:Peter
Status:closed
Description

I committed the class 'org.jnode.test.core.HeapHammerTest' earlier today. To run it, boot JNode with all plugins, then run the following:

kdb on

org.jnode.test.core.HeapHammerTest

About 3 times out of 4, the test will cause the GC thread output an NPE stacktrace to the console error stream (red). The system then locks up.

I've done some preliminary digging around and it appears to me that the NPE is being thrown in the following line in org.jnode.vm.memmgr.GCMarkVisitor.mark()

vmClass = VmMagic.getObjectType(object);

Now this is a magic method that gets special treatment by the native compiler. But I cannot figure out how it is throwing an NPE ... or how NPEs are thrown in general.

Would anyone care to go on a bug hunt? I'm hoping that this bug will be the one that causes grief for 'startawt' and 'javac'.

#1

I'm attaching my hacked version of GCMarkVisitor for folks to play with. Note this should NOT be committed.

#1

Assigned to:Anonymous» Stephen Crawley

I'm starting to look at this.

#2

I'm starting to make some progress I think. We've known for some time that the problem is caused by unexpected NPEs in the mark phase, and we suspect it was caused by object corruption. Yesterday I was able to establish that the immediate cause of the NPEs was object pointers to things with a flag word of 00000002 and TIB word of 00000000. The TIB was causing the NPEs.

Today, I transplanted the verify visitor method "verifyChild" into the mark visitor, and ran the code a few times. The result is repeatable GC crashes where a child object pointer is not null, but not an object. In each case the parent object has class "org.apache.log4j.ConsoleAppender". I'm now trying to figure out what is specially about this class and/or the way it is being used.

#3

It might be useful to know which static or instance field of the ConsoleAppender points to an invalid object.

#4

Frustration. The GC failures stopped happening when I included a copy of the log4j source code in the build path. I backed that out, but now I cannot even get the failures to occur in a clean sandbox.

#5

Just after end of boot, I am launching a 'gc' and get the error given in attached screenshot.
It sounds like to be a similar bug ...

AttachmentSize
gc_bug.png46.05 KB

#7

It could well be a similar bug, or even the same one. But your example is hard to diagnose. The parent object is an Object[] which gives us no clues about where the bad object may be coming from.

Do you think there is any chance that I could reproduce your example from a clean checkout? (No uncommitted code, no build tweaks?)

#8

I have made some tests too.

In all my tests it was always class "org.apache.log4j.ConsoleAppender" that has missing childs.

What I observed is the following: If I run the HeapHammerTest at one point it will trigger a gc run. This will sweep instances of the classes "org.apache.log4j.PatternLayout", "org.apache.log4j.helpers.OnlyOnceErrorHandler", "org.apache.log4j.helpers.QuietWriter". These instances are the ones referenced in the ConsoleAppender instance in question.

Why are these three instances treated as garbage and freed or why isn't the ConsoleAppender?
I do not have an answer to that (yet) but I made some other observations:

If the gc gets called a second time it does _not_ always fail during preverify but sometimes during the mark phase (for the same reason as it normaly fails in (pre)verify). Imho it should fail in verify right after sweep or in the next preverify phase but never in mark.

I saw that the mark stack overflows three times (even right after jnode bootup) and I'm not quite sure but there might be a bug in the handling of mark stack overflows. However I increased the mark stack size until all objects fit into it and even than the 3 named classes above got swept. So even if there's a bug in overflow handling it's not the only one.

Btw, without the verify stuff JNode might still run some time without a crash as the object only gets marked as free in the heap but the content will not be erased until a new object allocation in the same slot. I.e. you might survive more than two gc cycles without the extended verification.

#9

One thing that is different about the classes is that they come from a (rather old) external JAR file. It is likely that the classfile version numbers are pre-Java 1.6.

#10

Priority:critical» minor
Assigned to:Stephen Crawley» Peter

I reassign that bug to me and set it to minor as it is not a real bug (see below). My local version is already working fine. I can run the HeapHammerTest several times, startawt, run HeapHammerTest in the graphical console,...

The bug itself is very simple. It is not in the garbage collector itself but in the Verifier. The interessting thing is that ConsoleAppender inherits from AppenderSkeleton which has a finalize method. What happens? Let's only look at the mark-sweep-cleanup cycle: Since after each shell command at least one instance of ConsoleAppender gets garbage, ConsoleAppender and all its childs will not get marked. Sweep sees that and marks the objects garbage. Cleanup than sees the garbage and frees it from the heap and sets the object color back to white (i.e. free/not-visited). But not for objects that need Finalization (i.e. ConsoleAppender). They get specially marked and at some point the FinalizerThread will call finalize, set a special bit in the header and the _next_ cleanup cycle will remove the object.

The HeapHammerTest allocated very much memory in a short time and ConsoleAppender and its childs were marked garbage in the first cycle. In the second cycle the Verifier walks the Heap and ConsoleAppender is still on the Heap as it will be swept in the next clean phase. The Verifier checks for the object color, but the finalize bit is not part of the color! So the Verifier tries to verify objects that are already garbage and badly fails on the childs as they are already removed from the heap in the first gc-cycle.

The simple fix is:

+ if (VmMagic.isFinalized(object)) {
+ return true;
+ }

in the beginning of Verifier's visit method.

While debuging the code I found some other parts of the code that are very hard to read and understand. Mainly due to lack of documentation. Also the heap-verification could be improved a bit. I will make a clean patch (without my local debug stuff in it) in a few days for review.

In the meantime feel free to fix the three lines above, I will not have time before monday to commit/send a patch.

#11

Well done! Does that also solve this bug: http://www.jnode.org/node/1402 ?

Kind regards
Jacob Kofod

#12

Where is that Verifier class ? I can't find it in jnode sources. Is that part of openjdk ?

#13

I think Peter is talking about the class:

    org.jnode.vm.memmgr.def.GCVerifyVisitor

#14

Status:active» fixed

I commited a fix for the Verifier bug. And I commited another fix for the MarkVisitor. The problem is the following:

Values on the stack of each Thread are considered to be in the rootset (e.g. a newly created object in a local variable). Since we don't have a map of the stack we have to treat every 32bit value on the stack as a potential reference. As you can see in GCMarkVisitor#visit inside the condition

} else if (rootSet || (gcColor == GC_GREY)) {

yellow (i.e. garbage) and white (not visited yet) objects get treated the same. This should not happen. By definition the rootset can not be garbage so JNode should die at that point. But with the above said we simply have to ignore that for the moment imo since any Integer on the stack might point to a garbage object in the heap.
I tested it heavily, different memsizes, different command orders and it seems stable.

Stephen, don't get mad at me because of further spamming the console Smiling But as it's still a goal to make the debug stuff optional for the console too (like we have it for the serial port) I hope it's ok for the moment.

Jacob, right. I tested and I could not see a difference between running the gc before startawt or not. I again tried different memory setups, one problem remains: Memory fragmentation. For setups <= 400MB the heap is so fragmented that a 2MB array for an offscreen framebuffer can not be allocated anymore.

#15

Status:fixed» closed

Automatically closed -- issue fixed for two weeks with no activity.