Is FileArgument pathname expansion a bad idea?

Submitted by Stephen Crawley on Fri, 11/16/2007 - 05:16.

JNode-Shell

(After posting this as a blog entry, I realised I should have made it a forum topic ...)

In traditional operating systems, there are two approaches to implementing pathname patterns, or wildcards as they are often called. The UNIX approach is to have the command shell expand the patterns and pass the resulting pathnames to the command. Thus "cat *.java" typically leads to the "cat" command being run with multiple file names as separated arguments. The "cat" command does not get to see the original pattern.

The DOS / VMS approach is for the patterns to be passed intact to the command. It is then up to the command to treat an argument containing wildcard characters literally, expand it into a list of pathnames, or interpret it some other way. (In the case of DCL, there is a command for expanding wildcards that allows you to process them in a script.)

The advantages of the UNIX approach is uniformity and simplicity. Wildcards are always treated the same in the context of a given shell. Furthermore, an application command does not need to take account of patterns.

The advantage of the DOS / VMS approach is that it is not a one-size-fits all solution. Command arguments that do not represent pathnames need not be expanded. Furthermore there is room to implement command syntaxes like the DCL "copy *.foo *.bar" syntax. This means copy all files with the extension "foo" to a corresponding file with the same base name and the extension "bar". (Incidentally, one could implement this as a UNIX command, with the caveat that the user would need to quote the arguments; e.g. "dclcopy '*.foo' '*.bar'".)

So that's the background.

In JNode, the builtin interpreter understands argument quoting and escapes, as well as redirection and pipeline syntax. But it does not perform pathname expansion. This currently happens when the application code calls "getFiles()" on a FileArgument that has been populated by the Syntax argument parser. This present a number of problems:

If an application's Syntax doesn't define an argument as a FileArgument, or if it uses "getValues()" rather than "getFiles()", it will see the unexpanded pattern. (OK that is arguable an application bug. But it is pretty common right now. For example, neither "dir" or "cat" get this right.)
Classic java applications which do their own argument parsing independent of Syntax and friends will no be prepared to do pathname expansion. Which means that they will behave differently from "native" JNode applications.
There is a problem if a user needs to include a literal '*' or '?' character in a pathname. By the time the argument string reaches the pathname expansion code, the interpreter will have stripped quotes and '\' escapes. So, for "xyz '*.java'" the application would see arguments "*.java" and would dutifully expand the pattern. Double escapes or escaped quotes (e.g. "xyz \\*.java" or "xyz \'*.java\'") would probably work, but this approach is ugly ... and interpreter specific: see below.
If we want to support a UNIX style shell, having the application perform / control pathname expansion is potentially problematic. For example, if the "foo" application did pathname expansion, then invoking "foo \*.java" from the bjorne shell (with globbing enabled) would run "foo" with one argument. But the foo command would then expand the argument into multiple pathnames ... which is not what you would expect. The user would need to use double escapes to stop this happening. Ugly.

IMO, a good summary of the above is "a horrible mess".

I've thought of two ways to address this. The easy way is to move pattern expansion out of FileArgument.getFiles() and put into the "redirecting" command interpreter ... where it belongs (IMO).

The other alternative is to try to make FileArgument (and similar) aware of how the original arguments were formed by the interpreter; i.e. whether pathname expansion would have been performed, which if any characters in the argument were quoted / escaped. This will be complicated, and it is not immediately obvious that it can be made to work without user visible anomalies.

Commants? Ideas?

Some thought

Submitted by Peter on Thu, 11/22/2007 - 22:52.

Hi Stephen,

perhaps I have some usefull comments for this "horrible mess" Smiling
I'm not sure in the end how to do it but I don't think we should move the expansion into the interpreter or at least not completely. First of all I like the principles of the Argument handling and I think the FileArgument fits very well into it. At the moment it is very basic, but just to give an impression what could possibly come next: jtar will get all tar files, when you call "jtar *" (no .java files,...). Further we could do advanced pathlevel handling. In Linux I know about "/**/*.foo", might be my ignorance, but I don't know a way of doing recursive handling (i.e. the expression wouldn't return /a/b/c.foo).
The second reason is that with FileArgument handling the Application can decide itself if it wants the expended or non-expanded version of the argument (and for sure we should fix any app that uses it incorrect).

I'm not sure about the details but at the moment we allready differentiate between Command.execute and App.main. So could it be a solution to do the expansion for "normal apps" and the "normal" Argument handling for JNode commands?

About your dclcopy: If it is a JNode Command, for that command you would probably use .getValue() instead. So the user doesn't need to escape *.foo in order to get what he expects.

Re: your thoughts ... continued

Submitted by Stephen Crawley on Sun, 11/25/2007 - 04:33.

Re: "jtar *" matching only tar files.

I suppose one could do that, but I don't think that it would improve usability. On the contrary, the new user experience would be like that of someone trying to learn a new "natural" language like English; lots of irregular verbs to learn and a bazillion strange idioms.

Re: "recursive" patterns in UNIX.

UNIX has historically taken the view that wildcards should not match a "/". Instead, you typically use a utility like "find" if you want perform some action on all files in a tree. For example, "rm `find . -name \*.java`" or "find . -name \*.java | xargs rm" will remove all ".java" files in a tree. (The first version fails if there are too many files ... the second does not.)

But actually, this is NOT an argument for putting file expansion into FileArgument. If we decided to implement a richer (or just different) wildcard "language", we could do this in the shell / interpreter. Indeed it is non unknown for (UNIX) shells to have different wildcard languages. For example, the classic "sh" did not support "[...]", "{,..}" or "~". These expansions were first introduced in "csh".

However, I would suggest that a wildcard syntax that supports recursion is going to be hard to understand and error prone for the average user. The UNIX "find" approach seems to me to be a better idea.

Re: apps "deciding" whether to use the expanded or non-expanded version.

The decision may actually be rather complicated for the app.
The decision making process is not reflected in the syntax that "help" displays to the user sees.
The decision making process could easily be inconsistent across different apps, leading to confusion, etc. Bear in mind that in the long term we won't be in a position to correct apps that "do the wrong thing" ... because some of them will be third-party code.
Command completion will not be aware of how the application is going to treat wildcard characters, so it cannot complete a filename containing wildcards.

Re: differentiating between Command.execute and Command.main.

Actually, the three existing interpreters are currently NOT aware of the different entry points. The resolution of class entry points occurs in the invokers. The three existing invokers try to hide the differences between the two entry points. Unfortunately, the fact that System.* streams are global means that stream redirection cannot work for some invokers using the Command.main entry point.

Another point is that the Command.main entry point for many JNode commands is just a wrapper for the Command.execute entry point. Thus it is incorrect to assume that the Command.main entry point will not use FileArgument.

Re: "dclcopy".

I think you missed my point. My point was that "dclcopy" is in fact implementable in the context of a UNIX shell. A better example would be the (real) UNIX "find" and "grep" commands where it is common practice to escape wildcard (and other) characters to avoid premature expansion by the shell. This is no problem for an experienced UNIX user.

Re: your thoughts

Submitted by Stephen Crawley on Sun, 11/25/2007 - 03:18.

Taking your points one at a time:

I agree that the Info/Syntax/Parameter/Argument classes are good in principle. Specifically, I see the following really good points:

having a uniform way of expressing the command syntax,
doing argument parsing and completion based on the syntax,
providing the argument -> value bindings to the command as a simple data structure, and
providing command help based on the syntax.

However there are a number of important issues:

The whole argument syntax, parsing and completion is under-specified (for application writers) and under documented (for end users).
The syntax system has a problem with ambiguity. It is easy to define a command's syntax so that there multiple ways to parse it. For example 'cat -u ftp://foo' matches two different syntaxes.
There are two ways to represent "flag" options, which complicates parsing.
Conversely, there are lots of things that the syntax system cannot express. For example, it cannot express:
- multiple options, where the order doesn't matter; e.g. "cp -f -r ..." should mean the same thing as "cp -r -f ..."
- combining multiple flag options; e.g "cp -fr ..." should be equivalent to "cp -f -r ...".
- short and long forms of option; eg "cp -r ..." could be equivalent to "cp --recursive ..."
- it wouldn't be able to cope with the command line syntax of (say) "cvs" or "svn" or old-style "tar".
The syntax system follows the "old fashioned" model of each application defining its own command line syntax. Maybe it would be better if the syntax was defined separately from the app; e.g. in an XML file. This would allow:
- users to tailor an application's syntax to their own preferences,
- language localization of application syntaxes (and help information),
- shell-specific syntax variants; e.g. a POSIX shell script would expect the "cp" command syntax to be compatible with the POSIX specification of "cp",
- support for different concrete meta-syntaxes; e.g. in the "foo" shell, we might want to run the "cp" command as "someDir := cp /r file1, file2".

Next point:

While FileArgument fits into the Syntax/Parameter/Argument model, when you add wildcards, this starts to break down. In particular:

The now user needs to know if each individual application is going to:
- treat wildcard characters in a given argument as literal characters,
- expand them in the 'normal' way, or
- treat them in an application specific way.
By contrast, a UNIX shell does wildcard (and other) expansions the same way, irrespective of the command. This is much easier for users to understand.
The application may need to know what the shell has done with respect to expansion; e.g. to avoid double expansion. Coding an application to be aware of the shell behavior is a bad idea, not least because it is a burden on the application writer.
Conversely, the shell may need to know what the application will do so that it can leave escapes in place. For example, we would want "cp a\* b" to copy a file named "a*" to "b". But for this to work, the shell needs to know that it should not strip out the backslash.
Classic Java applications do not use the Syntax/Parameter/Argument classes, and therefore cannot rely on them to do wildcard expansion. But the user probably does need and expect wildcard expansion to be done.
If wildcard expansion is sometimes done by FileArgument and sometimes by the shell, we are open to all sorts of anomalous behavior. For example, suppose that the shintax for "cp" is "cp <FileName> ... <DirName>". Now consider running "cp a *b" in a directory containing files "a" and "b" and a subdirectory "dirb". Wildcard expansion by the shell will bind "a" and "b" to the first parameter and "dirb" to the second one. But wildcard expansion by the Argument objects will either give an command syntax error (since the second parameter is single-valued) or an application error (because "*b" is not an existing directory).
We haven't even started to think how the default JNode shell might handle things like variable expansion, command expansion (backticks), expansion in "here" documents, and so on. How is this going to interact with wildcard expansion by FileArgument and friends?

To be continued ...

Couldn't the interpreter

Submitted by Matthias on Sat, 11/24/2007 - 12:29.

Couldn't the interpreter determine wether the command expects expanded or non-expanded files? That way I can make a version of cat, dir or whatever that's simple, and later extend it to also make use of more advanced features. This obviously has an analogy to calling functions. If the interpreter can't find a version of the command that accepts unexpanded pathnames, it could expand it for him and call the function on each pathname.

The problem is ...

Submitted by Stephen Crawley on Sun, 11/25/2007 - 04:41.

... that the POSIX shell spec requires that wildcard expansion is done BEFORE the shell decides which command to execute. It usually doesn't make any difference, but it is not difficult to come up with examples where it does.

Well, as long as it is

Submitted by Peter on Sat, 11/24/2007 - 13:15.

Well, as long as it is _our_ version of cat, dir or whatever we can use the FileArgument anyway. So the command can decide itself if it wants the pattern (.getValue()) or the expanded files (.getFiles()).

We also have an easy way to determine if a command is a JNode command (it extends Command interface) or a standard application (we need to call main(..)).

The problem is in normal java apps. Should an app get the expanded file list or see the pattern? What if "foo" expects a pattern but "bar" a list of files? ...

I don't have an answer to these questions and I see that it isn't that easy to solve, but what I know, and what I wanted to express in my first post is, that I like the principles of our Syntax handling and our FileArgument and that I wouldn't drop it.

BTW. I just remembered a discussion with levente, perhaps he wants to comment on that too. Afair we said, that it would make sense to have a syntax-description for non-jnode commands too. E.g. if we include an external app our descriptor might include a BNF String to describe the commands syntax. It wouldn't be an advantage for the command but for the user, because he gets nice tabcompletion also for non-jnode apps.
So perhaps it could really be an idea to handle it via syntax too (i.e. we don't expand except it is noted in the BNF syntax string of the command's descriptor)?

BNF descriptions of syntax

Submitted by Stephen Crawley on Sun, 11/25/2007 - 05:02.

(I've addressed most of your points in other posts ...)

Real BNF is a bit clunky; e.g. you can only express repetition through recursion. EBNF is a bit better, because it supports optional and repeated groups on the RHS of a production. Unfortunately both BNF and EBNF are unfamiliar to the majority of programmers, and they are not suitable for end users.

Other alternatives are XML, a formalization of the meta-syntax used in UNIX manual entries, or a formalization of the meta-syntax output by the JNode "help" command.

To my mind, the key issue is not the language (meta-syntax) we use for expressing syntaxes. Rather, the big thing that we need to do is to get (concrete) command syntax specifications out of the application classes. The application should simply provide a set of abstract parameters and the expected type and multiplicity of their values. If an application expects a parameter to be filename pattern, the parameter type should be set accordingly.

Take a look at Powershell

Submitted by Bluebit on Tue, 11/27/2007 - 12:08.

Powershell uses attributes (annotation in java) to define what input is expected by a command (called cmdlets in Powershell).

Take a look here:
http://msdn2.microsoft.com/en-us/library/ms714433.aspx

Comments on Powershell

Submitted by Stephen Crawley on Wed, 11/28/2007 - 02:17.

The potentially worthwhile thing that Powershell adds (relative to the current JNode model) is use of annotations on get/set methods to hook them up with the command argument parsing scheme.

Interesting, but I think I prefer the JNode approach of accessing parameter values via a binding object. Certainly, using attributes (i.e. Java annotations) won't help in the (my) goal of separating the concrete command syntax(es) from the application code.

Powershell's 'ValueFromPipeline...' attributes are nasty and ad hoc IMO. The rest of the attributes are more or less mirrored by existing JNode Syntax functionality.

Command and annotations

Submitted by Bluebit on Wed, 11/28/2007 - 12:47.

I think that annotation help in the goal of separating the concrete command syntax from the application code. No parsing is needed in the application code when you mark the setters methodes with information about what the command expects as input. The shell should then try to parse the commandline and match that with the annotationed setters in the command. This would make the command syntax agnostic.

Example:

public class TestCommand extends AbstractCommand
{
public enum Option1Enum {a, b, c};
public enum Option2Enum {x, y, z};

private Option1Enum option1;
private Option2Enum option2;

@Parameter
public void setOption1(Option1Enum selectedOption)
{
option1 = selectedOption;
}

@Parameter
public void setOption2(Option2Enum selectedOption)
{
option2 = selectedOption;
}

public void execute() {
...
}

}

And you could then write this in the commandline:

>testcommand -a -y

or the shell could accept this variant:

>testcommand -ay

it depends only on the shell!

If the command needs a list of files it could look like this:

private File[] filesToCopy;

@Parameter
public void setFilesParameter(File[] files)
{
filesToCopy = files;
}

I the commandline:

>TestCommand *.jar

The shell interprets the pattern and give the command the list of files matching the pattern. Again it depends on the shell if it wants to accept that kind of pattern.

If a command want to interpret a pattern on its own you could do something like this:

@Parameter
public void setFilesParameter(string filesPattern)
{
// interpret filesPattern
}

It would be the same in the commandline:

>TestCommand *.jar

Annotations do not provide the separation we need

Submitted by Stephen Crawley on Wed, 11/28/2007 - 18:04.

Annotations allow the application programmer to define command line syntax at a higher level of abstraction. This has clear advantages for the application programmer ... versus the classic "roll your own" approach to command line parsing.

However, the definition of the command line syntax remains in the source code ... i.e. in the FooCommand.java file. The syntax cannot be altered without altering the source code, and that impacts other users / uses of the command.

For our purposes, the concrete command syntax specification needs to be separate from the code file for reasons previously stated. I say "needs" advisedly, because without the ability to tailor the command line syntax, some things we are trying to do become a lot more difficult. Specifically creating a UNIX-like shell and command set for JNode.

Another example

Submitted by Bluebit on Wed, 11/28/2007 - 19:05.

I have a command that accepts a date as the first parameter.

The command would look something like this:

public class FooDateCommand extends AbstractCommand
{
private DateTime date;

@Parameter
public void setDate(DateTime Date)
{
date = Date;
}

...

}

One shell only accepts this date format (yyyy-mm-dd) so the command should be called in this way:

>FooDateCommand 2007-11-27

Another shell only accepts this date format (dd-mm-yyyy) so the commandline in this shell looks like this:

>FooDateCommand 27-11-2007

A thrid shell is maybe very smart and knows what format the user prefer and uses that.

Does it matter for the command? No as long as it gets a DateTime object it's fine.

Annotations do provide separation!

Submitted by Bluebit on Wed, 11/28/2007 - 18:22.

The command programmer only defines what input the command takes, he doesn't define the syntax.

The shell should use reflection to look at the command and then try to match what the user is writing on the commandline with the parameters that the command accepts.

Do you know if Annotations work on JNode yet?

Submitted by Stephen Crawley on Wed, 11/28/2007 - 19:25.

Have you tried them?

Yes it should work

Submitted by Bluebit on Wed, 11/28/2007 - 20:14.

According to this http://www.jnode.org/node/638 it should work.

Active forum topics

Recent blog posts