" Lock not owned by us " bug
Hi
I want to discuss about the "famous" bug "Lock not owned by us".
This bug was reported for the first time in this topic Bug
The last days I've investigated this issue and i want to share with you my findings for a few reasons :
- maybe i am on the wrong path
- maybe someone else will see the solution
- i want to solve this issue because it is annoying
I want to review the bug:
We have this block
synchronized (this) {
...
// call something that throws an exception
...
} // end synchronized
If an exception occurs in the middle of the block we have a panic error  like this:
 Lock not owned by us:java.util.Date00000001 12AA05D8 12AA05D0 00000000 00043E00 7FFFFE00 Debug stacktrace: org.jnode.vm.VmStackReader::debugStackTrace
, org.jnode.vm.VmStackReader::debugStackTrace
, org.jnode.shell.command.DateCommand::test12
, org.jnode.shell.command.DateCommand::
, org.jnode.shell.command.DateCommand::main
, org.jnode.vm.VmReflection::invoke
, java.lang.reflect.Method::invoke
, gnu.java.security.action.InvokeAction::run
, org.jnode.vm.VmAccessController::doPrivileged
, java.security.AccessController::doPrivileged
, org.jnode.shell.ThreadCommandInvoker$ThreadCommandRunner::run
, java.lang.Thread::run
, org.jnode.shell.CommandThread::run
, org.jnode.vm.scheduler.VmThread::runThread             
So lets start with the beginning.
This is my class that I've used for test ( the date command )
public class DateCommand {
public static Help.Info HELP_INFO = new Help.Info("date", "prints the current date");
    /**
     * Sets up any instance variables and processes the command arguments.
     *
     * @param args
     *                The arguments to work with.
     * @throws Exception
     */
    public DateCommand(String[] args) throws Exception {
        System.out.println(new Date());
test12();
}
public void test12() throws Exception {
Date lock = new Date();
        synchronized (lock) {
            test123();
}
}
public void test123() throws Exception {
throw new Exception("Test");
}
    /**
     * Displays the system date
     *
     * @param args
     *                No arguments.
     * @throws Exception
     */
    public static void main(String[] args) throws Exception {
        new DateCommand(args);
    }
}
This error appears in the class MonitorManager method  public static void monitorExit(Object object)
This method basically removes the lock of the object .(It does a lot of things : decrement the counter and so on).
The lock of an object is kept in the header of the object. In this header the VM keeps the ID of the thread and the counter. Every time when monitorexist is called the counter is decrement. When the counter is 0 the ID of the thread is removed . This error appears because this method is called twice for my object . On the first call the ID of the thread is removed from the object header . When this method is called for the second time the condition that verifies if the ID of correct thread is in the header object fail and oopps error.
The method monitorexist is called when virtual machine reads a byetcode called monitorexit.
The parameter of the method are from the top of the stack.
Now let's look at bytecode of my method test12:
public void test12()   throws java.lang.Exception;
  Code:
   0:   new     #38; //class java/util/Date
   3:   dup
   4:   invokespecial   #40; //Method java/util/Date."":()V
   7:   astore_1
   8:   aload_1
   9:   dup
   10:  astore_2
   11:  monitorenter
   12:  aload_0
   13:  invokevirtual   #54; //Method test123:()V
   16:  aload_2
   17:  monitorexit
   18:  goto    24
   21:  aload_2
   22:  monitorexit
   23:  athrow
   24:  return
  Exception table:
   from   to  target type
    12    18    21   any
    21    23    21   any
As you see we have 2 monitorexit in the bytecode. When i saw this for the first time i told myself: "This is the error : 2 monitorexist bytecode 2 calling of the method monitExit". But i was wrong . After reading carefully this code i realized that in a normal situation only one monitorexit bytecode is executed ( which is normal ).
At line 13 the method test123 is called. If no exception occurs the VM pushes the lock(or reference ??) on the stack and calls monitorexit (line 16,17). And then go to return. But if an exception occurs the VM will look in the exception table and tries to find a match for this exception and finds it (the fist line in exception table )12    18    21   any
So then it will execute the lines 21,22 which is a monitor exit. And it throws the exception (line 23) to the next method ( at least this is what i think). So this block from 21-23 is put by the compiler to be sure that when an exception occurs the monitor is exited. So in a normal situation only one monitorexit must be called. 
Now what's happening in our code:
When the exception occurs at line 13 the VM tries to find a match and it finds it in the table exception (the fist line). Then go to line 21 and execute monitorexist and then throw the exception (line 23). Now it tries to find again a match in the same exception table of the same method and it finds it and go again to line 21 and go to monitor exit and => panic error. I've tracked this behavior in the method VMSystem.findThrowableHandler.
This is what i've found until now. Please shoot if I'm wrong. If you see the solution, I'll give you a beer.
I will continue my investigation and let you know.
Best regards
Andrei Dore
- Login to post comments
 
  
 
 
 

Exception table
After a discussion with Peda i want to share some conclusion
What it is very strange in the bytecode is the second entry table in exception table :
Exception table:
from to target type
12 18 21 any
21 23 21 any
The second seems to be a little bit recursive :
But after some discussion with Peda and some googling I've found 2 links :
Link 1
Link 2
The second link explain why the second entry it is so weird
So it seems that the problem how it is execute the last athrow in the byte code.
Let's we look the spec.
Spec
Method void onlyMe(Foo)
0 aload_1 // Push f
1 astore_2 // Store it in local variable 2
2 aload_2 // Push local variable 2 (f)
3 monitorenter // Enter the monitor associated with f
4 aload_0 // Holding the monitor, pass this and...
5 invokevirtual #5 // ...call Example.doSomething()V
8 aload_2 // Push local variable 2 (f)
9 monitorexit // Exit the monitor associated with f
10 return // Return normally
11 aload_2 // In case of any throw, end up here
12 monitorexit // Be sure to exit monitor...
13 athrow // ...then rethrow the value to the invoker
Exception table:
From To Target Type
4 8 11 any
(It is strange why in this example we don't have the second entry in exception table )
The comment at the line 13 is : // ...then rethrow the value to the invoker
So the problems seems to be that we don;t "interpret" correct the last athrow: we must throw the exception to the invoker and not to try to find the same correct exception table . But we must find this in spec
Thanks Peda for time to discuss about this bug.
Best regards
Andrei Dore
Exactly with the "synchronized" keyword
I seen this Panic due the current workings with the javax.usb many times. I also seen whenever i used the synchronized keyword for doing some works, this panic is happening.
Thanks for your investigation.Keep it up.
Thanks
Tanmoy
Good Point
Roughly this is the point where I got with my investigation last time. I also observed that compiling a code with ecj and javac will produce silightly difefrent bytcodes and while one isn't working the other is working, though both version are valid Java bytecodes. IMHO this should be the right path for finding the bug which should be lying in the bytecode to native compiler. One point is to verify how the basic blocks are indentified in the bytecode and if the bytecode offsets for the basic blocks are correct. If this is OK then the emiting of the native code is the next place to look.
Testing the same code in Jikes - RVM
Hi
I've tried to see how the guys from Jikes RVM handle this problem .
First I've tried to find in the code something special that treats this special case . The code is very similar with jnode code and the hunting was very easy . But unfortunately i didn't find a special treatment for this case. (Maybe a i miss something )
Then I've tried to debug to see what's happening when the RVM VM run my code. When test123 method throws the exception in test12 method the VM find the "catch block" in the table exception ( it was the first entry ) .And when the athrow is executed VM tries again to find a catch in the table exception but it doesn't find because ( this is the strange part ) the line that throws the exception is not in the table exception . If i look at the bytecode i said that it is a little impossible.
public void test12() throws java.lang.Exception;
Code:
0: new #2; //class java/lang/Object
3: dup
4: invokespecial #1; //Method java/lang/Object."":()V
7: astore_1
8: aload_1
9: dup
10: astore_2
11: monitorenter
12: aload_0
13: invokevirtual #3; //Method test123:()V
16: aload_2
17: monitorexit
18: goto 26
21: astore_3
22: aload_2
23: monitorexit
24: aload_3
25: athrow
26: return
Exception table:
from to target type
12 18 21 any
21 24 21 any
That's my finding so far
The hunting continues
Best regards
Andrei Dore
Any progress on this?
Have you made any more progress in the bug hunt?
No
This week i was a little busy with another stuff. Today i will try again to hunt the beast
Best regards
Andrei Dore
Jikes RVM
I will try again to look how Jikes RVM handle this situation ( maybe i miss something last time ) . If somebody has some skills in Jikes RVM i appreciate a new pair of eyes
Best regards
Andrei Dore
Eclipse compiler and Sun compiler
I've test both and it seems that the Sun compiler generate the good exception tables and it works in JNode
So if you want for moment to work be sure that you compile with Sun compiler
Best regards
Andrei Dore
Your workaround seems to work for me
My cursory testing is telling me that your workaround does indeed get rid of the "Lock not owned by us" kernel panics.
For the record, I used the "build.sh" script to do a clean followed by a build. Then I booted the JNode image and ran the following:
dhcp ethCOMPLETE
cat -u ftp://ftp.gnu.org/README
Previously the "cat" command was giving me a kernel panic. Now it is giving me stack traces for Java exceptions ... but no kernel panic.
Saga continues
Hi
Yesterday was a full day for this bug
I write again the example:
0: new #38; //class java/util/Date
3: dup
4: invokespecial #40; //Method java/util/Date."":()V
7: astore_1
8: aload_1
9: dup
10: astore_2
11: monitorenter
12: aload_0
13: invokevirtual #54; //Method test123:()V
16: aload_2
17: monitorexit
18: goto 24
21: aload_2
22: monitorexit
23: athrow
24: return
Exception table:
from to target type
12 18 21 any
21 23 21 any
QaDes come with good observation . HE said that the interval in a exception entry in the exception table it is something like this [start_pc,end_pc). With this observation now the exception table make sense. Peda also find this in spec . Then we verify if our code treat that interval in this approach . Unfortunately the code was correct.
In the VMSystem.findThrowableHandler it is a line
match = ceh.isInScope(address);
This line verify this interval . The code it is correct so in this situation the argument must be wrong . The end_pc address and address of the line that throw the exception could be wrong .
Peda make a interesting experiment and he discover if we insert some dummy opcode between monitorexit and athrow it works ( no panic)
We track this addresses(end_pc , athrow) ( after the bytecode is transform in native code) and we discover that the difference it is 1 byte . So the end_pc address it is higher with 1 byte than athrow address. This is not normal situation .In a normal situation this address must be equal ( as we see in the exception table).
We make assumption that the athrow address it is wrong. We find that VMSystem.findThrowableHandler method it is called in an asm code : vm.asm file
vm_athrow_notrace_pop_eip:
pop ADX ; return address
lea ADX,[ADX-5] ; The call to this method is a 5 byte instruction
push AAX ; save exception
push ADX ; save address
; Setup call to SoftByteCodes.findThrowableHandler
push AAX ; exception
push ABP ; frame
push ADX ; address
mov AAX, vm_findThrowableHandler
INVOKE_JAVA_METHOD
; eax now contains the handler address of the exception, move it to ebx
mov ABX,AAX
pop ADX ; restore address
pop AAX ; restore exception
We imagine that this code must be the root of evil.
For fun we replace the 5 from lea instruction with 4 and 6 . With 6 doesn't work but with 4 works (no panic error) ( I must also verify the address to see what happening )
 (no panic error) ( I must also verify the address to see what happening ) 
This is it for now .
Team of this bug : Qades, Peda and me
Best regards
Andrei Dore
To give a bit more
To give a bit more information about the asm code: The ByteCodeVisitor uses so called "softbytecode" to compile an athrow statement. It's basically a setup of moving the Exception-ref to EAX (i.e. first register) and than does a jumptable call into the asm code (vm.asm) as Andrei wrote above.
No the original intention as I understood it: The "call" instruction will push EIP onto the stack (It is the address of the opcode directly following the "call" instruction). So "pop ADX" will take that EIP and put it in the ADX register. "lea ADX,[ADX-5]" basically is a decrementation of ADX by 5, i.e. ADX = ADX - 5. Watching the comment I would suggest that this code removes 5 bytes what is supposed to be the size of the "call" opcode. So ADX should point to the beginning of the athrow statement.
If I write some test code to see the asm output, I can see, that the call instruction occupies 6 bytes, not 5, so Andrei tested a JNode version that does "lea ADX,[ADX-6]". This version simply dies on bootup.
UPDATE: That is not correct, it behave the same incorrect way as the "5" version.
Like Andrei said, we did some other tests, with manually created bytecode. I will only take some lines of the examples to show the interresting stuff:
...
21: aload_2
22: monitorexit
23: athrow
24: return
Exception table:
from to target type
21 23 21 any
This is code as compiled by eclipse. ECJ tends to start with 'aload'ing the exception, monitorexit and rethrow the exception. I.e. the "to" address in the exception handler matches the bytecode number of the athrow statement. In contrast to that, SUN tends to aload the exception directly before it is rethrown, that looks like that:
...
21: monitorexit
22: aload_2
23: athrow
24: return
Exception table:
from to target type
21 22 21 any
So the handler also only covers the monitorexit for both, sun and ecj. But the difference is the aload right before the athrow, which moves the athrow to a higher place as before. This way we gain enougth extra-space before the athrow, i.e. the address is larger and thus it is not in the scope of the handler.
We did the "other way round" too and tested the "from" address handling, something like that:
21: monitorenter
22: ...
23: athrow
24: ...
25: monitorexit...
Exception table:
from to target type
23 2x 25 any
So the athrow statement should be covered by the handler. But it doesn't get handled in JNode.
We verified the translation from "from" and "to" address in bytecode format to the native code addresses, and that is ok. The code should be absolutly valid.
So in my opinion the bug has to be in the calculation of the address, i.e. the "call" and "pop ADX" stuff. Perhaps someone has an other idea when looking at the code. Like I said, I'd expect an value of 6 if anything at all. Because omitting the "lea" call should work too in my understanding, but it didn't in the tests. And the version with "lea ADX,[ADX-4]" even passed the manual testcases. So even for ecj compiled stuff Andrei didn't see a panic.
What does that mean? Any asm-experts to explain the code?
Ok, the problem is mainly
Ok, the problem is mainly solved, and I'll keep you updated what happened:
First of all, the assembler code expects the EIP to be on top of the stack and the exception reference to be in EAX.
If we look at the ByteCodeVisitor we see in visit_athrow, that the code checks, if the exception is in EAX. In most cases I tested, the exception is on the stack, and thus that code writes a "pop eax" which occupies 1 byte in the native stream. The next instruction is the call it self, where I did an mistake yesterday. The "call" used in exception handling is only 3 bytes long. Thus 1+3=4 bytes.
So in most cases the athrow statement gets translated to 4bytes mashine code, but, at least theroetically 3bytes might be enough (to name it: in cases where the exception allready has been in EAX).
The address in ADX should in the end be an address within the athrow statement (the original intention was to use the instruction beginning). ADX-5 is too much, since I showed above that in most cases we have 4 bytes. That's why the current fix works in most situations. But I'm almost sure that l1a might output code for athrow with only 3 bytes, which will lead us to the same problem as before. So with ADX-3 we should match any case. Which leads me to the idea of simply taking ADX-1, then we're on the save side for sure.
I hope this explanation is understandable
and thanks andrei for the good work!
The checked-in fix works
The checked-in fix works for me
Suggestions
(I haven't been following the details of this bug hunt, but ...)
You could try adding some debug code to see what address is actually being captured in ADX, and whether that address really is pointing to the start of a 5 byte call instruction. Also check that the Java method being dispatched to is expecting the address of the call instruction, and not the address of (say) the instruction before or after it.
HTH.
The Java method being
The Java method being dispatched is expecting the address, we're sure about that. You can easily verify that in VmSystem#findThrowableHandler, which gets called by the asm code.
About ADX, I don't know much about assembly, so I wouldn't be able to write that testcode in asm, but Andrei debuged the address as passed to the java code. The problem is, we can't verify the value because I/we have no clue how to get the _real_ address of the instruction.
Part of my suggestion was ...
Part of my suggestion was to dereference the value in ADX to see if it points to the instruction you expect it to ...