" Lock not owned by us " bug

Hi

I want to discuss about the "famous" bug "Lock not owned by us".

This bug was reported for the first time in this topic Bug

The last days I've investigated this issue and i want to share with you my findings for a few reasons :
- maybe i am on the wrong path
- maybe someone else will see the solution
- i want to solve this issue because it is annoying

I want to review the bug:

We have this block

synchronized (this) {
...
// call something that throws an exception
...
} // end synchronized

If an exception occurs in the middle of the block we have a panic error like this:
Lock not owned by us:java.util.Date00000001 12AA05D8 12AA05D0 00000000 00043E00 7FFFFE00 Debug stacktrace: org.jnode.vm.VmStackReader::debugStackTrace
, org.jnode.vm.VmStackReader::debugStackTrace
, org.jnode.shell.command.DateCommand::test12
, org.jnode.shell.command.DateCommand::
, org.jnode.shell.command.DateCommand::main
, org.jnode.vm.VmReflection::invoke
, java.lang.reflect.Method::invoke
, gnu.java.security.action.InvokeAction::run
, org.jnode.vm.VmAccessController::doPrivileged
, java.security.AccessController::doPrivileged
, org.jnode.shell.ThreadCommandInvoker$ThreadCommandRunner::run
, java.lang.Thread::run
, org.jnode.shell.CommandThread::run
, org.jnode.vm.scheduler.VmThread::runThread

So lets start with the beginning.

This is my class that I've used for test ( the date command )

public class DateCommand {

public static Help.Info HELP_INFO = new Help.Info("date", "prints the current date");

/**
* Sets up any instance variables and processes the command arguments.
*
* @param args
* The arguments to work with.
* @throws Exception
*/
public DateCommand(String[] args) throws Exception {
System.out.println(new Date());

test12();

}

public void test12() throws Exception {

Date lock = new Date();

synchronized (lock) {
test123();

}

}

public void test123() throws Exception {

throw new Exception("Test");

}

/**
* Displays the system date
*
* @param args
* No arguments.
* @throws Exception
*/
public static void main(String[] args) throws Exception {

new DateCommand(args);
}

}

This error appears in the class MonitorManager method public static void monitorExit(Object object)
This method basically removes the lock of the object .(It does a lot of things : decrement the counter and so on).

The lock of an object is kept in the header of the object. In this header the VM keeps the ID of the thread and the counter. Every time when monitorexist is called the counter is decrement. When the counter is 0 the ID of the thread is removed . This error appears because this method is called twice for my object . On the first call the ID of the thread is removed from the object header . When this method is called for the second time the condition that verifies if the ID of correct thread is in the header object fail and oopps error.

The method monitorexist is called when virtual machine reads a byetcode called monitorexit.
The parameter of the method are from the top of the stack.

Now let's look at bytecode of my method test12:

public void test12() throws java.lang.Exception;
Code:
0: new #38; //class java/util/Date
3: dup
4: invokespecial #40; //Method java/util/Date."":()V
7: astore_1
8: aload_1
9: dup
10: astore_2
11: monitorenter
12: aload_0
13: invokevirtual #54; //Method test123:()V
16: aload_2
17: monitorexit
18: goto 24
21: aload_2
22: monitorexit
23: athrow
24: return
Exception table:
from to target type
12 18 21 any
21 23 21 any

As you see we have 2 monitorexit in the bytecode. When i saw this for the first time i told myself: "This is the error : 2 monitorexist bytecode 2 calling of the method monitExit". But i was wrong . After reading carefully this code i realized that in a normal situation only one monitorexit bytecode is executed ( which is normal ).
At line 13 the method test123 is called. If no exception occurs the VM pushes the lock(or reference ??) on the stack and calls monitorexit (line 16,17). And then go to return. But if an exception occurs the VM will look in the exception table and tries to find a match for this exception and finds it (the fist line in exception table )12 18 21 any
So then it will execute the lines 21,22 which is a monitor exit. And it throws the exception (line 23) to the next method ( at least this is what i think). So this block from 21-23 is put by the compiler to be sure that when an exception occurs the monitor is exited. So in a normal situation only one monitorexit must be called.

Now what's happening in our code:

When the exception occurs at line 13 the VM tries to find a match and it finds it in the table exception (the fist line). Then go to line 21 and execute monitorexist and then throw the exception (line 23). Now it tries to find again a match in the same exception table of the same method and it finds it and go again to line 21 and go to monitor exit and => panic error. I've tracked this behavior in the method VMSystem.findThrowableHandler.

This is what i've found until now. Please shoot if I'm wrong. If you see the solution, I'll give you a beer.
I will continue my investigation and let you know.

Best regards
Andrei Dore

Exception table

After a discussion with Peda i want to share some conclusion

What it is very strange in the bytecode is the second entry table in exception table :

Exception table:
from to target type
12 18 21 any
21 23 21 any

The second seems to be a little bit recursive :

But after some discussion with Peda and some googling I've found 2 links :

Link 1
Link 2

The second link explain why the second entry it is so weird

So it seems that the problem how it is execute the last athrow in the byte code.

Let's we look the spec.

Spec

Method void onlyMe(Foo)
0 aload_1 // Push f
1 astore_2 // Store it in local variable 2
2 aload_2 // Push local variable 2 (f)
3 monitorenter // Enter the monitor associated with f
4 aload_0 // Holding the monitor, pass this and...
5 invokevirtual #5 // ...call Example.doSomething()V
8 aload_2 // Push local variable 2 (f)
9 monitorexit // Exit the monitor associated with f
10 return // Return normally
11 aload_2 // In case of any throw, end up here
12 monitorexit // Be sure to exit monitor...
13 athrow // ...then rethrow the value to the invoker
Exception table:
From To Target Type
4 8 11 any

(It is strange why in this example we don't have the second entry in exception table )

The comment at the line 13 is : // ...then rethrow the value to the invoker

So the problems seems to be that we don;t "interpret" correct the last athrow: we must throw the exception to the invoker and not to try to find the same correct exception table . But we must find this in spec

Thanks Peda for time to discuss about this bug.

Best regards
Andrei Dore

Exactly with the "synchronized" keyword

I seen this Panic due the current workings with the javax.usb many times. I also seen whenever i used the synchronized keyword for doing some works, this panic is happening.

Thanks for your investigation.Keep it up.
Thanks
Tanmoy

Good Point

Roughly this is the point where I got with my investigation last time. I also observed that compiling a code with ecj and javac will produce silightly difefrent bytcodes and while one isn't working the other is working, though both version are valid Java bytecodes. IMHO this should be the right path for finding the bug which should be lying in the bytecode to native compiler. One point is to verify how the basic blocks are indentified in the bytecode and if the bytecode offsets for the basic blocks are correct. If this is OK then the emiting of the native code is the next place to look.

Testing the same code in Jikes - RVM

Hi

I've tried to see how the guys from Jikes RVM handle this problem .

First I've tried to find in the code something special that treats this special case . The code is very similar with jnode code and the hunting was very easy . But unfortunately i didn't find a special treatment for this case. (Maybe a i miss something )

Then I've tried to debug to see what's happening when the RVM VM run my code. When test123 method throws the exception in test12 method the VM find the "catch block" in the table exception ( it was the first entry ) .And when the athrow is executed VM tries again to find a catch in the table exception but it doesn't find because ( this is the strange part ) the line that throws the exception is not in the table exception . If i look at the bytecode i said that it is a little impossible.

public void test12() throws java.lang.Exception;
Code:
0: new #2; //class java/lang/Object
3: dup
4: invokespecial #1; //Method java/lang/Object."":()V
7: astore_1
8: aload_1
9: dup
10: astore_2
11: monitorenter
12: aload_0
13: invokevirtual #3; //Method test123:()V
16: aload_2
17: monitorexit
18: goto 26
21: astore_3
22: aload_2
23: monitorexit
24: aload_3
25: athrow
26: return
Exception table:
from to target type
12 18 21 any
21 24 21 any

That's my finding so far
The hunting continues

Best regards
Andrei Dore

Any progress on this?

Have you made any more progress in the bug hunt?

No

This week i was a little busy with another stuff. Today i will try again to hunt the beast

Best regards
Andrei Dore

Jikes RVM

I will try again to look how Jikes RVM handle this situation ( maybe i miss something last time ) . If somebody has some skills in Jikes RVM i appreciate a new pair of eyes

Best regards
Andrei Dore

Eclipse compiler and Sun compiler

I've test both and it seems that the Sun compiler generate the good exception tables and it works in JNode

So if you want for moment to work be sure that you compile with Sun compiler

Best regards
Andrei Dore

Your workaround seems to work for me

My cursory testing is telling me that your workaround does indeed get rid of the "Lock not owned by us" kernel panics.

For the record, I used the "build.sh" script to do a clean followed by a build. Then I booted the JNode image and ran the following:

dhcp ethCOMPLETE
cat -u ftp://ftp.gnu.org/README

Previously the "cat" command was giving me a kernel panic. Now it is giving me stack traces for Java exceptions ... but no kernel panic.

Saga continues

Hi

Yesterday was a full day for this bug

I write again the example:

0: new #38; //class java/util/Date
3: dup
4: invokespecial #40; //Method java/util/Date."":()V
7: astore_1
8: aload_1
9: dup
10: astore_2
11: monitorenter
12: aload_0
13: invokevirtual #54; //Method test123:()V
16: aload_2
17: monitorexit
18: goto 24
21: aload_2
22: monitorexit
23: athrow
24: return
Exception table:
from to target type
12 18 21 any
21 23 21 any

QaDes come with good observation . HE said that the interval in a exception entry in the exception table it is something like this [start_pc,end_pc). With this observation now the exception table make sense. Peda also find this in spec . Then we verify if our code treat that interval in this approach . Unfortunately the code was correct.

In the VMSystem.findThrowableHandler it is a line
match = ceh.isInScope(address);
This line verify this interval . The code it is correct so in this situation the argument must be wrong . The end_pc address and address of the line that throw the exception could be wrong .

Peda make a interesting experiment and he discover if we insert some dummy opcode between monitorexit and athrow it works ( no panic)

We track this addresses(end_pc , athrow) ( after the bytecode is transform in native code) and we discover that the difference it is 1 byte . So the end_pc address it is higher with 1 byte than athrow address. This is not normal situation .In a normal situation this address must be equal ( as we see in the exception table).

We make assumption that the athrow address it is wrong. We find that VMSystem.findThrowableHandler method it is called in an asm code : vm.asm file

vm_athrow_notrace_pop_eip:
pop ADX ; return address
lea ADX,[ADX-5] ; The call to this method is a 5 byte instruction

push AAX ; save exception
push ADX ; save address

; Setup call to SoftByteCodes.findThrowableHandler
push AAX ; exception
push ABP ; frame
push ADX ; address
mov AAX, vm_findThrowableHandler
INVOKE_JAVA_METHOD
; eax now contains the handler address of the exception, move it to ebx
mov ABX,AAX

pop ADX ; restore address
pop AAX ; restore exception

We imagine that this code must be the root of evil.

For fun we replace the 5 from lea instruction with 4 and 6 . With 6 doesn't work but with 4 works Smiling (no panic error) ( I must also verify the address to see what happening )

This is it for now .

Team of this bug : Qades, Peda and me

Best regards
Andrei Dore

To give a bit more

To give a bit more information about the asm code: The ByteCodeVisitor uses so called "softbytecode" to compile an athrow statement. It's basically a setup of moving the Exception-ref to EAX (i.e. first register) and than does a jumptable call into the asm code (vm.asm) as Andrei wrote above.
No the original intention as I understood it: The "call" instruction will push EIP onto the stack (It is the address of the opcode directly following the "call" instruction). So "pop ADX" will take that EIP and put it in the ADX register. "lea ADX,[ADX-5]" basically is a decrementation of ADX by 5, i.e. ADX = ADX - 5. Watching the comment I would suggest that this code removes 5 bytes what is supposed to be the size of the "call" opcode. So ADX should point to the beginning of the athrow statement.

If I write some test code to see the asm output, I can see, that the call instruction occupies 6 bytes, not 5, so Andrei tested a JNode version that does "lea ADX,[ADX-6]". This version simply dies on bootup.
UPDATE: That is not correct, it behave the same incorrect way as the "5" version.

Like Andrei said, we did some other tests, with manually created bytecode. I will only take some lines of the examples to show the interresting stuff:

...
21: aload_2
22: monitorexit
23: athrow
24: return
Exception table:
from to target type
21 23 21 any

This is code as compiled by eclipse. ECJ tends to start with 'aload'ing the exception, monitorexit and rethrow the exception. I.e. the "to" address in the exception handler matches the bytecode number of the athrow statement. In contrast to that, SUN tends to aload the exception directly before it is rethrown, that looks like that:

...
21: monitorexit
22: aload_2
23: athrow
24: return
Exception table:
from to target type
21 22 21 any

So the handler also only covers the monitorexit for both, sun and ecj. But the difference is the aload right before the athrow, which moves the athrow to a higher place as before. This way we gain enougth extra-space before the athrow, i.e. the address is larger and thus it is not in the scope of the handler.

We did the "other way round" too and tested the "from" address handling, something like that:

21: monitorenter
22: ...
23: athrow
24: ...
25: monitorexit...
Exception table:
from to target type
23 2x 25 any

So the athrow statement should be covered by the handler. But it doesn't get handled in JNode.

We verified the translation from "from" and "to" address in bytecode format to the native code addresses, and that is ok. The code should be absolutly valid.

So in my opinion the bug has to be in the calculation of the address, i.e. the "call" and "pop ADX" stuff. Perhaps someone has an other idea when looking at the code. Like I said, I'd expect an value of 6 if anything at all. Because omitting the "lea" call should work too in my understanding, but it didn't in the tests. And the version with "lea ADX,[ADX-4]" even passed the manual testcases. So even for ecj compiled stuff Andrei didn't see a panic.

What does that mean? Any asm-experts to explain the code? Smiling

Ok, the problem is mainly

Ok, the problem is mainly solved, and I'll keep you updated what happened:

First of all, the assembler code expects the EIP to be on top of the stack and the exception reference to be in EAX.

If we look at the ByteCodeVisitor we see in visit_athrow, that the code checks, if the exception is in EAX. In most cases I tested, the exception is on the stack, and thus that code writes a "pop eax" which occupies 1 byte in the native stream. The next instruction is the call it self, where I did an mistake yesterday. The "call" used in exception handling is only 3 bytes long. Thus 1+3=4 bytes.
So in most cases the athrow statement gets translated to 4bytes mashine code, but, at least theroetically 3bytes might be enough (to name it: in cases where the exception allready has been in EAX).

The address in ADX should in the end be an address within the athrow statement (the original intention was to use the instruction beginning). ADX-5 is too much, since I showed above that in most cases we have 4 bytes. That's why the current fix works in most situations. But I'm almost sure that l1a might output code for athrow with only 3 bytes, which will lead us to the same problem as before. So with ADX-3 we should match any case. Which leads me to the idea of simply taking ADX-1, then we're on the save side for sure.

I hope this explanation is understandable
and thanks andrei for the good work!

The checked-in fix works

The checked-in fix works for me

Suggestions

(I haven't been following the details of this bug hunt, but ...)

You could try adding some debug code to see what address is actually being captured in ADX, and whether that address really is pointing to the start of a 5 byte call instruction. Also check that the Java method being dispatched to is expecting the address of the call instruction, and not the address of (say) the instruction before or after it.

HTH.

The Java method being

The Java method being dispatched is expecting the address, we're sure about that. You can easily verify that in VmSystem#findThrowableHandler, which gets called by the asm code.
About ADX, I don't know much about assembly, so I wouldn't be able to write that testcode in asm, but Andrei debuged the address as passed to the java code. The problem is, we can't verify the value because I/we have no clue how to get the _real_ address of the instruction.

Part of my suggestion was ...

Part of my suggestion was to dereference the value in ADX to see if it points to the instruction you expect it to ...