Source file encoding

Sorry to bring this topic up yet again. But the latest bug report by Bluebit makes me write this entry.

I don't know of any distribution for at least 2 years that does not have UTF-8 as default encoding for everything. As far as I know the same applies for MacOS X. I'm not sure how Windows handles this, can anyone shed some light into this?

It would make life much easier if we'd switch to UTF-8. You could use any editor you like without paying attention to select the correct encoding, you can type and read non 7bit characters more easily,... And something I don't understand: Why happened that bug? Isn't \uxxxx supposed to work? Does that mean Levente will never be able to write his name correctly? Smiling

For reference: source file encoding is specified here and here.

What do you think?

Let's not jump to conclusions

This could be due to non-ASCII characters in a source file, or the source file could be correct and the generation step could be producing a file with non-ASCII characters in it.

It is certainly true that most distros (and that probably includes Windows) should be able to deal with UTF-8. But UTF-8 may not be the default, or the user or sys admins may have changed or overridden the default; e.g. to make something else work.

So not-withstanding that UTF-8 encoding "ought to work", our safest bet is for the JNode source code and all generated Java code is to use / continue to use US-ASCII with \uxxxx's. This avoids problems when people try to build JNode or reuse parts of JNode on platforms that cannot cope with UTF-8 for some reason.

(This is a bit off-topic, but there is also the issue that JNode itself does not yet support UTF-8 fully. I think that the input drivers do UTF-8 properly now. However, there is definitely a problem displaying non-ASCII characters on the JNode console. I doubt that we could edit a UTF-8 encoded Java file on JNode ... yet.)

You're right

Ok, point taken. It was silly to make an early conclusion. Probably javacc made \uxxxx to an UTF-8 character which javac refused to compile, ok.
Though "our safest bet is for [...] all generated Java code to use / continue to use US-ASCII with \uxxxx's" makes me think if that works at all?

Regarding JNode's state is imho not a point. It does not make a difference if you have \uxxxx or an UTF-8 character in your source file. In both cases JNode has to print a unicode character. You'd have to forbid the use of non 7bit US-ASCII (This only makes a sense for compiling JNode inside JNode).

7 bit US-ASCII is just so email-80ies and as someone with umlauts in his language I tend to dislike ASCII Smiling Ok, eclipse (and others) can deal with that (it would refuse to save non ASCII-chars) but e.g. using emacs you have to set the encoding each time as emacs is not able to determine the encoding by looking at the sources (UTF-8 is default).

I still like the idea of switching to UTF-8 and I don't think anyone could not cope with UTF-8. But it's not that I could not live with ASCII. I just thought as Java is a modern unicode aware language a unicode capable encoding would be nice (Though I wouldn't suggest UTF-16 or any other UTF).

Re: JNode's support for Unicode

You are right. The fact that JNode's text consoles don't display Unicode is not a reason not to not use UTF-8 encoding for out source code. But it is kind of ironic that JNode is actually worse than Emacs in this respect.

I might be wrong but I think

I might be wrong but I think that the \uxxxx form still need an encoding itself and thus might cause problems too.

So, I think we should find the appropriate encoding for the common usage (which can include Levente name) and enforce it for all jnode sources.
For other characters that might be needed, we should be able to use \uxxxx form.

Fabien

my blog : en français, in english or both