Improving the JNode compiler.

JNode needs a better, faster native-code compiler. My gut feeling is that JNode runs Java code roughly an order of magnitude slower than the best Java implementations do. This impacts on most areas of JNode performance, with the long startup time for the JNode GUI, and running garbage collection being two of the most obvious examples.

So I want us to start thinking about improving the JNode compiler. Here are some questions to start the discussion:

  • Can we improve the existing compiler, or would we do better starting from scratch?
  • Is there an existing compiler we could port? (I looked at Jikes RVM, but I think that we would run into licensing problems. This is not to say that we cannot borrow ideas though ...)
  • Do we want to continue with the current strategy of compiling to native code before executing, or should we adopt the mixed interpret/compile strategy used by Sun and Jikes?
  • Does anyone know a good place to start learning about code generators / optimizers?

A good place to start learning about code generators/optimizer ?

Wirth and Gutknecht work on Oberon OS is not yet totaly outdated.

A remarquable way to solve an to explain how to solve all the problem one can find in developing an OS. A still a main source for inspiration.

The last references on Wirth's page on wikipedia point to
http://www-old.oberon.ethz.ch/WirthPubl/ProjectOberon.pdf
and http://www-old.oberon.ethz.ch/WirthPubl/CBEAll.pdf

AlainD

L1A Label Handling

A while back I stated that the Label-handling in L1A has a high memory consumption.
I thought I'd let everybody know that I made some tests regarding the Labels in L1A. In the old implementation each Label contained a String which is a concatenation of many substrings (like classname, methodname, bytecodenumber,..) and therefore it used StringBuilder alot.

So I switched the Label from a String-storing to an int-based autoincreasing implementation (execpt for those labels that are needed for linking the nano kernel). As a simple testcase I messured the bootup time, typed memory, ran a full mauve run, typed memory again. The result: No time improvement at all and <<1% memory savings. I guess the other l1a classes just have a too large memory usage to make those StringBuilder-savings worth it (Btw, the TextAsm output of JNode was of course unreadable with that change).

Garbage collector

Hi all,

My first post here...

Congratulations for sharing this wonderful Java operating system with the world!

This is probably well know for you but I ran Jnode 0.21 with 41Mbytes and I didn't found references to a so small amount of allocated memory in the posts on this Web site.
I was able to work with Jnode command line shell with only 41 Mbytes of allocated memory on Jnode 0.21 by adding System.gc() calls at various places in the fat and NTFS file systems and adding a few lines of code to manage gc() requests at most every two seconds.
The places where I inserted calls to System.gc() are after loops where many objects are created. To detect those loops I used PMD: http://pmd.sourceforge.net/
I used Jnode 0.21 because it is simpler than the current code base for a newbie. And I removed a lot of things that I don't use like ext2 fs.

Unfortunately the bootstrapp process still need roughly 256Mbytes probably because I didn't add any System.gc() in the bootstrapp code (which is quite unclear for me at the moment).

jean-Pierre

Welcome & Nice experiments

Hello Jean-Pierre,

welcome to JNode. Nice experiments and a nice result with only 41MB!

Regarding bootstrap it will be hard to limit the memory as at the time the memory management (and therefore the gc) is working is at a late time compared to the amount of allocated memory.

Inserting System.gc() calls of course wouldn't be good for the repo code. But it shows that there's still a huge amount of code that could be optimized.

What I/we would be interessted in, is spotting these code points. Could you give me some examples for these loops? And did you see obvious points where this could be optimized without gc calls?

Of course we'd wish that you'd use the latest version of JNode or svn/git. Anyway I think much stuff of the core/VM code did not change that much over the time, so that even patches against 0.2.1 might be interessting.

GC experiment follow up

Hi,

I feel a little stupid with all those dirty experimentations but I inserted a System.gc() in the method allocStack of VmSystem.java and put a constraint of 5 seconds between two call to gc() in GCManager.java and now Jnode 0.21 boots correctly to the command line shell with only 128 Mb.
An attempt to boot with 96Mb failed.
For me the main lesson learned is *not* that we must use my trick, I am not sure either the GC() has to be improved, the main lesson for me is that the way objects are created could be better managed. That was why I initially searched for object creation in loops.
For example I would like to see something like allocation of local objets on a stack, that way (IMO) it would remove the need to call the garbagge collector as they would be removed automatically and without any time penalty at the end of the parent object life.

Jean-Pierre

The modification in VmSystem.java:
protected static Object allocStack(int size) {
try {
System.gc() ;
return Vm.getHeapManager()
.newInstance(
systemLoader.loadClass(
"org.jnode.vm.VmSystemObject", true), size);
} catch (ClassNotFoundException ex) {
throw (NoClassDefFoundError) new NoClassDefFoundError()
.initCause(ex);
}
}

The modification in GCManager.java:
/**
* Do a garbage collection cycle.
*/
final void gc() {

// if it's premature, don't proceed
if(VmSystem.currentKernelMillis() < ( lastTime + 5000 ))
return ;

// Prepare
...

Escape Analysis

What you suggest is called Escape Analysis. I.e. trying to determine the scope of an allocated object. If the object's scope is local, you can allocate it on the stack, else you need the heap.
This definitly sounds like a good idea, but you might be mistaken by the work that has to be done to check the object's scope. Just have a look at an easy pattern:

public int foo() {
   Bar bar = new Bar();
   bar.doSomething();
   return bar.getInt();
}

Now you could argue that instance "bar" is local, is it? Well, you can not be sure, all three calls above might make the object live to a global scope. Just think of something like that:

// Bar's constructor..
public Bar() {
  // ... init ...
  GlobalService.registerBar(this);
}

And there are many other usage patterns which make it difficult to determine if a object is stack-allocatable or not. Escape Analysis is a usefull thing, and afair Hotspot uses it but it's not easy to implement I think.

Escape Analysis 2

(first sorry for the double post)

Thanks for the link Peter, it's very clear for me now.

I have a question if you don't mind:
- Does registering a Logger to Log4j in a class constructor makes this class impossible to recycle through garbagge collector, as there would always a valid reference from the Logger to it, even if there is no other reference?

Jean-Pierre

Garbage collector and Log4J

Hi all,

Roughly one year ago I lowered the memory requirements of the 0.2.1 by inserting an impractical but aggressive garbage collector call in the code (see above).
It was only for testing goals but it lowered the used memory from ~256 Mb to 41 Mb.
Now I used the same codebase (0.2.1) and removed all reference to Log4J (a long and frustrating work). The memory used is just 45 Mb (without aggressive garbage collection).
So --for me-- it means that Log4J is forbidding garbage collection on nearly 200 Mb by maintaining reference on otherwise unused memory.

What do you think?

Jean-Pierre

This is a very interesting

This is a very interesting result. Thank you for your work and for posting it. I was thinking in the past of removing the dependency of the core on Log4j but I didn't expect it would have such a dramatic effect.
It would be interesting to figure out where and why does Log4j or the way JNode is using it leak memory so significantly.

There are/will be Isolates

Hmm, I'm not aware about the implementation details of Log4j and I didn't check. If Log4j stores the reference to the Logger in a static field or a static collection the object would never be recycled, right.

No matter how Log4j handles this, this is a general usecase I did not consider until now. Anyway, there's a bit more to that: If you search in our forum you'll find several sites explaining and talking about Isolates/Isolation. This is JNode's way to have "processes" and if a Isolate is garbage all its static fields should be garbage. I'm saying "should" as I guess this statement holds, but I'm not 100% sure and it's definetly unimplemented.

We are way off the original forum topic ...

... but don't forget that native compiled classes are currently not garbage collected.

... a worthwhile activity.

The most significant improvement of JNode performance could be achieved by the second level bytecode to native compiler: l2. Two attempts for having l2 are: an unfinished compiler in the source tree called l2 and the unfinished integration of the JikesRVM compiler which is in a separate branch. L2 could be used for the build-time compilation of the JNode core on high optimization level and as a second level optimizing compiler while running under JNode assuming that adaptive compilation support will also be implemented which involves detecting the most often used methods and compiling them with more advanced optimizations.
Since a JIT compiler cannot operate at maximal performance and also create the most optimal native code in the same time it's necessary to have the l1 compiler which operates fast and the l2 compiler which creates highly optimized code.
The performance of l1 is much influenced by where l1 is executing. The build process demonstrates that l1 has a good performance when executed on a production quality JVM and it has an inferior performance when its executed under JNode because and JNode its performance is limited by the suboptimal code it generates, l1 being compiled by l1 for execution under JNode. If we had l2 in place and used during build-time for compiling l1 that would bring an immediate and significant performance improvement for l1 under JNode. The performance of l1 can be improved by optimizing its code and algorithm but it's also important to note that even more important for l1 is to be able to cope with all Java bytecode that is considered correct and ready for execution by the current production quality JVMs. I know it from experience that currently l1 often fails when it encounters unreachable blocks in the bytecode and when it has to compile obfuscated classes but also at bytecode generated by special tools, not derived from Java sources. Other issues may arise as more real world code is tested under JNode.
So for l1 I would set the priorities like:
- correct operation in terms of accepted bytecode and generated code
- run-time performance
- performance of generated code (can be traded for improved run-time performance)
Before improving the run-time performace of l1 we can perform rigorous profiling by using the various highly accurate and productive profilers available for the Java platform, if we execute l1 on the Sun JVM.

The performance of the garbage collector depends on the quality of the generated code by the byetcode to native compiler, the efficiency of the garbage collector algorithm and the size of the Java heap. At the moment the usually large size of the heap is a significant factor in GC perform ace. When isolates will support separate heaps then the GC can run on the heap of the isolate and will pose shorter pause times compared to the case when it has to collect the whole heap of the complete system.

For L2 it would be interesting to complete the integration of the JikesRVM compiler with JNode. This would provide us the best generated code and it's probably the least effort approach for having an optimizing compiler under JNode. It's also the highest impact approach to every aspect of JNode perforates mentioned above so this is of the highest priority.

Is there any information about the L2 compiler attempts?

I took a look at L2 in the main tree some time back and couldn't find anything that explained the data structures. Does anyone know what the code design is based on? How much of it works? How much more needs to be implemented?

And what about the Jikes RVM integration? How much work was done on that? Should that be continued or would it be better to start again with a fresh Jikes RVM checkout?

RVM integration nearly finished

The Jikes RVM integration was at a very good level. Most of the stuff worked and even a bootimage compiled with Jikes RVM booted up to some point.
Though there is at least one remaining bug regarding Exceptions (I think it happens for default exception handlers) which crashed the VM. Giving a working Jikes JIT in the branch it would not be hard to port it to trunk, even though the branch is old.

I followed and "helped" ansari whereever I could, back when he did the porting work. And I have to say that this was quite too complicated for me. But given what I've seen and tested ansari's work, I'd definitly say that finishing Jikes integration is much much better/easier than starting from fresh.

No clue about L2 at all.

I'm having a go at ...

... completing the Jikes RVM integration. Over the weekend, I've been trying to merge the last 2000 or so revisions in "trunk" into the jikes-rvm branch. [ SVN merge + sourceforge is painful Sad ]

That merging can be painful

That merging can be painful indeed but there is a way to avoid it.
Most likely the JikesRVM compiler does not need those 2000+ revision (mostly unrelated to JikesRVM integration) to be functional but it needs a set of bug fixes in the integration layer which make it work. Then it will be much easier to integrate the JikesRVM related code with trunk.

Yea I thought about that ...

... but since I don't know how to precisely identify those bug fixes, the painful merge is safer. I can use tree diffs to make sure I haven't lost updates in the merge process ... with all of the restarts, etc that are necessary because of SF + SVN flakiness.

EDIT: another point is that the jikes-RVM branch is before the classlib split, so merging removes the pain of Eclipse builds with not enough memory.

GC is a red-herring?

I'm starting to think that GC is a red-herring in this discussion. When I start the bjorne interpreter, I see a long pause while code is compiled ... but no GC. IIRC, even when I run startawt (with 1Gb memory) I see at most 1 GC cycle, and it still takes far too long.

Setting that aside, GC speed is undoubtedly affected by the quality of the compiler used to compile it. But I think in the long run we will find that the GC algorithm used and the details of the implementation are more important.

I've read a fair bit about garbage collectors, and I think that Levente is mistaken about the influence of heap size. In general, the cost of garbage collection is dominated by the cost of finding objects that are not garbage. With a copying collector (generational or not) you simply zero the 'from' space so that the constant of proportionality for garbage objects is very small: arguably zero. The way to get the best performance of a copying collector is to give it as much physical memory as possible to maximize the ratio of garbage to non-garbage. IIRC, the tricks that you need to do to implement incremental or low-pause GC are not affected by heap size. With a mark and sweep collector (like JNode's) both garbage and non-garbage costs you, but it is still best to maximise the garbage to non-garbage ratio if your primary aim is to minimize overall GC costs.

The idea of giving each isolate its own (small) heap to reduce GC pauses is interesting, but (AFAIK) untried. (I don't recall seeing any results published based on that approach.) My gut feeling is that it could work in the sense that an isolate which didn't create much garbage could avoid being 'frozen', but this would only be the case if the GC thread could run in parallel. We have to weigh this against the issue that isolate heaps would need to be scrupulously isolated, otherwise the GC has to deal with pointers from one heap into another. And for the algorithmic reasons mentioned above, it may actually increase the overall proportion of the total system's resources that are spent on memory management / garbage collection. In short, a lot of things would need to come right for isolate-local heaps to work ...

How much the gc pause

How much the gc pause depends on the heap size it's a function of the gc algorithm. In the case of the garbage collecter of the current JNode the time needed to complete the garbage collection grows with the size of the heap. It's trivial to test it with vmware if it's not clear from the internals of the gc code.

This depends on the GC algorithm

I'll grant you that JNode's current mark and sweep algorithm behaves this way. But a copying collector wouldn't. And an incremental or low-pause collector wouldn't. Read the literature for details.

I don't need you to grant me

I don't need you to grant me what's evident to me and I've been talking about the current GC in JNode.

Clarity + some math

It might have been evident to you what you were talking about. But it wasn't to me. Here is what you wrote:

The performance of the garbage collector depends on the quality of the generated code by the byetcode to native compiler, the efficiency of the garbage collector algorithm and the size of the Java heap.

That reads as a general statement about garbage collectors, not a specific statement about the behavior of the current JNode garbage collector. If you meant to restrict your statement to just the current JNode garbage collector algorithm you should have not included it as one of the three "variables" in the above "equation".

OK so lets just restrict ourselves to talking about mark and sweep algorithms. The cost of mark and sweep consists of the cost of the mark phase and the cost of the sweep phase. The mark cost is roughly proportional to the number of pointer cells in non-garbage objects. The sweep cost has a component that is proportional to the total number of objects (looking for unmarked objects, clearing mark bits), and a component that is proportional to the number of bytes of garbage (zeroing). When you add the two together, you get an equation that depends on the ratio of garbage to non-garbage object, as well as on the total heap size.

But the calculation that matters more that the time taken to GC is the total GC cost for any given computation / sequence of computations. This is the average cost of running the GC multiplied by the number of times that the GC has to be run. This is where the ratio of garbage to non-garbage really matters.

Suppose that cost of a GC run is C1 * NG + C2 * G, where C1 and C2 are constants, NG is the amount of non-garbage and G is the amount of garbage.

Now suppose that running application "foo" generates N objects total, and that (2/21)*N objects are non-garbage at any given time.

If I size the heap so that it can contain (3/21)*N objects, the GC will run 18 times costing C1*2*N + C2*1*N each time for a total of C1*36*N + C2*18*N.

But if I size the heap to contain (20/21)*N objects, then the GC will run just once costing C1*2*N + C2*18*N.

Now this is rather an extreme example, and the math is an over-simplification, but it does (I hope) demonstrate the problem with reducing heap size. The GC runs quicker, but you need to run it more often. Whether you win or lose by reducing heap size depends on the application, its input data, and on the constants C1 and C2 for the garbage collector. But I think that you lose on average.

Sorry, for not being clear

Sorry, for not being clear enough and thank you for the analysis.
With my first post to this thread I just wanted to share my opinion on the native to bytecode compilers with the assumption that it might help.
I'm a bit busy with my personal things these days but I hope to be back to coding JNode soon. I'm sure the research and findings of you guys will be useful in the future when (or as soon as) real work will be done on these domains.

Jikes RVM

Jikes RVM is licensed under EPL which is incompatible with GPL according to the GNU website. How do you propose we deal with this Levente?

One solution is to make sure

One solution is to make sure the JikesRVM compiler works with JNode. We can keep its license and provide it as a separate download if there are unsurmontable legal issues. The users are free to download, install and use it anyway.

LGPL should not be a problem

As we claim to have LGPL this should not be a problem with EPL. This is what Eclipse's SWT is doing with GTK,.. and that's why there wasn't a QT port (QT had been GPL for versions before 4.5). Though I do not want to ignore your concerns. It would be nice if we'd had professional legal help for things like that.

I guess talking about MRP could be an option, too (for those that are not aware of it, it's kind of a "fork" of jikesRVM made by Ian)

Btw, I especially like this point made by levente:
> The performance of l1 is much influenced by where l1 is executing. The build process demonstrates
> that l1 has a good performance when executed on a production quality JVM and it has an inferior
> performance when its executed under JNode because ...
We should try to get a better overview and document it, of where we loose performance, what is slow and what not. E.g. if method lookup is slow as hell, we can optimize the JIT as much as we like without getting significant performance gain. (Oh and I'm not sure if method-inlining is currently activated in trunk).

Performance of L1 in build environment

L1 does indeed run fast on a Sun JDK 1.6 JVM, and this is indeed a reflection on the quality of the Sun JDK.

So why is it so much slower on JNode? Here are some possible reasons:

  • L1 running on JNode is executing native code generated by L1. L1 running on Sun is executing code generated by the Sun HotSpot compiler. The (indisputable) fact that L1 generates poor code compared to the Sun HotSpot compiler means that L1 will be slower on JNode.
  • L1 generates a lot more garbage than it needs to, and JNode GC is very slow compared with Sun's GC. (OTOH ... I don't notice lots of GC cycles ... so unless JNode object allocation is also slow, this is probably not the biggest problem. Besides, if L1 produced better code, the GC and memory allocation would probably be faster!)
  • There are likely to be other performance issues in the JNode Core / FS area compared with Sun JDKs on top of the Linux stack; e.g. under-performing FS drivers, inadequate FS caching.
  • We typically run JNode with less memory than is available when building JNode, especially when we factor in memory used by the Linux OS and buffer caches. (OTOH, the fact that we don't see lots of GCs suggests that memory shortage per se is not the problem.)

BTW, I consider that quality of method dispatching code and decisions about method inlining are clearly JNode compiler issues not JNode runtime issues. However, the borderline between the compiler and the GC is not so clear cut. For example, I understand that the current GC implementation requires the compiler to emit extra instructions in loops so that a compute intensive program will yield control to the GC when required. A better GC would not require this. Similar tradeoffs arise with read and write barriers.

Agreed to your possible

Agreed to your possible reasons.

Commenting on your other reply: As I said, I didn't want to ignore your license concerns. I was just saying that we claim to be LGPL (which would be ok with EPL).

What I do not agree with, is the yield point control and barriers. First of, I guess you made a typo. Not the GC, but the JIT inserts the yield points and barriers (for yield points: the GC has nothing todo with that at all).

> A better GC would not require this

I guess you said this only in regard to yield points and not to barriers (I think we all agree that barriers need to be inserted for a good performant gc). Imho the cooperative-multitasking feature of JNode is a good one, and just as a note: Jikes/MRP is doing the same! Besides that, for realy efficient GC collection you probably will also need gc-safe points and without yield-points but real preemtion this will be much harder to implement.

The GC tradeoffs are really complicated ...

(For the record, I didn't say that the GC inserts yield points. Read my comment again. And ... no ... I didn't go back and edit it Smiling )

This discussion is interesting, but I think we are getting off the topic. IMO we should leave the GC algorithm and the GC / compiler interactions alone for now, and focus on the getting the L1 and L2 compilers to generate better code in other areas.

Re: LGPL should not be a problem

Well ... if you say so. But I just want us all to be aware that there could be consequences if we make the wrong call on this.

I would love to be able to reuse a quality compiler from some other project.

Target clarification

I put some thoughts into the questions for the discussion you raised.

But after rereading your introduction I found a problem in my understanding. I'm not quite sure if you are speaking about the performance of the generated code or about the time it takes the JIT to compile the code.

I had this problem in my understanding, as imho it is a common misunderstanding that the GUI is "slow": The compilation itself takes a huge amount of time, much stuff of java.awt.* and java.swing.* have to be compiled before a single line of native code gets executed at all. This can be seen when one starts up the GUI a second time. It will be much faster (if you have enough memory Smiling)!

The current JIT compiler (l1a aka level 1a) is meant to be a fast compiler in terms of compilation time with as good performance of the generated code as possible. So regarding the first question I'd definitly vote for "keep l1a" as imho it is a good compiler. Somewhere on the forum (didn't find it right now) there's a link to a paper explaining it.

I think the way to go is to create a l2 compiler or port jikesOpt to JNode as a second addon compiler for hotspots and bootimage creation but keep l1a as our fast baseline compiler.

As I have no clue if a new l2 compiler or porting e.g. jikesOpt is easier/better I'll give some ideas as to how we could improve l1a.
The main problem with l1a is not the l1a compiler itself (namely the X86BytecodeVisitor) but the underlaying Assembler that outputs the raw native code. There's a huge amount of stupid unnecessary allocation going on: Each bytecode produces about 3 Labels on avg. Each label is at least one "String + String" operation which translates to a StringBuffer and two adds (which in turn produces many allocations as the StringBuffer's initial size is allways too small) and another toString. Imho this could be totaly avoided with some work and would bring a huge boost (You all know our mm/gc Smiling).
There are quite some SoftByteCodes that get called for difficult java bytecodes. A really simple one is "ldiv" (long division). Each time a "ldiv" bytecode occurs, a Java method is called that calculates "long/long". (A while back we found a bug in that code and rewrote it which got a great performance boost for ldiv)
And there are more complicated ones (instance-of, invoke_*,..) which would be SoftByteCodes in other compilers, too. I did never look deeply at any of those SoftByteCodes, but at least those take a huge amount of time to execute. So perhaps we should try to list them and discuss about those in terms of current complexity,...

Oh and a last thing: If we could solve the allocation problem named above, the compiler would compile much faster. And even with current performance I'd tend to keep our strategy and not add an interpreter again (neither a java written nor asm one as we had).

Performance clarification

My gut feeling is that we have both problems: the compiler produces poor code, and it is slow. On the first problem, I remember looking at the generated code and noted that there seemed to be little attempt to use registers, eliminate unnecessary branches and so on. On the second problem we have your evidence, and of course if the compiler produces poor code the bootstrapped compiler will be slower as a direct consequence.

Addressing the L1a compiler's allocation inefficiencies could have a significant effect. However, the Sun / Jikes RVM data-points suggest to me that an interpreter + compiler strategy would be better in the long term ... assuming that we have the resources (i.e. the right people) to implement this.

I didn't realize that the old interpreter was implemented in assembler. That partly explains why people were keen to get rid of it Smiling, but there must have been more to it than that. I get the impression from looking at some of the old pages that eliminating the interpreter and precompiling the native classes improved overall performance. However, it sounds like the old JNode strategy was interpret OR compile. The Sun/Jikes strategy is interpret for the first few calls THEN compile ... using profiling data gathered while interpreting.

Finally, I'm noticing the slow startup effect in other places. For instance, on the machine I'm currently using it takes ~5 seconds to run "dir" the first time and ~30 seconds to run "propset -s jnode.interpreter bjorne". And I've even found that JNode is having to run the garbage collector when I run "halt" after doing some simple bjorne interpreter tests.

Interpreter

I'd like to give some more arguments against an interpreter.

First of, the original interpreter was indeed implemented in assembler. Afaik it emerged from Ewout's JBS/JBS2 and it's also why we call nanokernel a "kernel". In the beginning the interpreter was directly build into the kernel as a jumptable interpreter. Back when l1a was created, Ewout created the infrastructure to use different compilers for different code: This is mainly the distinction in BootImageBuilder if a class/package is marked to be compiled highlevel or not. Soon after that Ewout recognized that compiling _everything_ gives indeed a huge performance boost. [/history]

I guess you and everyone else agrees that we want to keep handwritten native code at a minimum, so the assembler interpreter wouldn't be an option at all. But if we write an Interpreter in Java itself it's performance depends on the JIT. So if the produced code quality of l1a is bad, the Interpreter will be bad.
I am not sure if you're correct about Jikes. I know they have an interpreter, but afair you have to configure it. Default operation is to use jikes baseline compiler (can be compared to l1a) and jikesOpt only for profiled code. About Sun I'd like to add that this is really good C++ code, and there are great compilers for C++.
All this makes me believe a interpreter for JNode would (at least now) be not a good idea. Anyway, of course I could be totaly wrong. But I think someone just had to write the interpreter to proove me wrong.

Re Performance clarification: I think too we have both problems. I still again want to point out, that these are two different problems and should be treated as this. I.e. anyone participating in this discussion with new ideas, comments,.. should make clear what should be addressed: JIT performance (the time the Jit needs to compile) and Code performance (the performance of the produced code).

As I found the paper I was talking about yesterday, I'd like to brievly explain the idea behind l1a. I'm not sure if you (Steve) are aware of it, but anyway it might be interessting for others, too. In our Reference Section you can find a link to this paper.
You can call the compiler an online algorithm. It does not do any look ahead. Each bytecode gets directly translated one by one. But as Java Bytecode is Stackbased you'd get "pop;pop;add;push"-sequences for as simple bytecodes as iadd. And that's where the paper pops in.
You have a VirtualStack which keeps track of the JavaVMs operand stack during compilation. You can push/pop Items to/from the VirtualStack. Items can be WordItems or DoubleWordItems, e.g. IntItem is word, LongItem is doubleword. And the cool thing is, each Item can keep track of its actual location: Is it in a register, on the stack or in heap.
Now if have a look at the "iadd" bytecode example from above. What happens now is: The JIT pops two times from the VirtualStack. It checks if at least one Item is in a register. If none is in a register, code gets emitted to load one value into a free register (There's a RegisterPool to keep track of that). After that the "add" code gets emitted (depending on the Item's location different add-operands get emitted. As a final step the JIT creates a new IntItem for the result and pushes that to the VirtualStack. As "add"s result will be in a register, the IntItem on the VirtualStack gets marked to be in a register (so the next bytecode might take advantage of that). Btw, the way free registers are served to the JIT is simple round robin.

And on another note: There's still much room for improvements. Think of a bytecode sequence like "dup,astore,astore" and suppose the initial Item on top of stack is in a register. At the moment in such a situation (and there are thousands like this) the "dup" code would translate to a "push" to the cpu stack. This happens, as the compiler can not be sure if the value in the register gets overwritten or not. Even with an online algorithm you could improve this: Mark the Item as a "COW"-Item (copy on write) and extend the JIT to respect the COW-Flag. If a asm opcode operates on a Register Item that is set COW it has to be saved before it gets overwritten.

I still think there are tons of other ways to improve l1a. I'll try to make some benchmarking of the current code (especially SoftbyteCodes) so we know the hotspots where to look. But anyway we'd need some more professional benchmarking suites. I have tested to run all I found on JNode, but none of them worked. Not even the simple microbenchmarks (which allready would be a great indicator). I'd really like to have a fixed setup with a good set of benchmark code to keep track of that. Our current performance results are quite old and the two "benchmarks" are rather simple.