3. Command snippet model¶
3.1. General Requirements¶
Target Shell
The command snippets handled by cmdse tools should be of bash
and POSIX
shell dialects.
Command Snippet Endoding
The command snippets text should be stored, processed and transported in US-ASCII encoding.
Command Snippet Metadata Encoding
The command snippet metadata text fields should be stored, processed and transported in US-ASCII encoding.
3.2. Command snippet requirements¶
Valid Command Snippet
A command snippet text must meet the following carracteristics to be valid:
- Must be a sequence of US-ASCII character literals
- Must be valid
POSIX
/bash v4
shell dialect
Command Parameters
A command parameter is denoted with POSIX-shell positional parameters[5] syntax: $1 .. $9
or their in-braces equivalent ${1}
.
3.3. Shell processing workflow¶
The way cmdse tools handle snippets and extract a great deal of information to the end-user requires a good understanding on how unix shells process text into commands.
Note
Some paragraphs of this section are greedly copied from The Bourne-Again Shell in “The Architecture of Open Source Applications”
The first stage of the processing pipeline is input processing (Fig. 3.1): taking characters from the terminal or a file, breaking them into lines, and passing the lines to the shell parser to transform into commands. The lines are sequences of characters terminated by newlines[2]. Such a shell-processable line is referred to as a command.
The second step is parsing. The initial job of the parsing engine is lexical analysis: to separate the stream of characters into words and apply meaning to the result. The word is the basic unit on which the parser operates. Words are sequences of characters separated by metacharacters, which include simple separators like spaces and tabs, or characters that are special to the shell language, like semicolons and ampersands.
The lexical analyzer takes lines of input, breaks them into tokens at metacharacters, identifies the tokens based on context, and passes them on to the parser to be assembled into statements and commands. There is a lot of context involved—for instance, the word for can be a reserved word, an identifier, part of an assignment statement, or other word, and the following is a perfectly valid command:
for for in for; do for=for; done; echo $for
that displays for
.
The parser encodes a certain amount of state and shares it with the analyzer to allow the sort of context-dependent analysis the grammar requires. For example, the lexical analyzer categorizes words according to the token type: reserved word (in the appropriate context), word, assignment statement, and so on. In order to do this, the parser has to tell it something about how far it has progressed parsing a command, whether it is processing a multiline string (sometimes called a “here-document”), whether it’s in a case statement or a conditional command, or whether it is processing an extended shell pattern or compound assignment statement.
Much of the work to recognize the end of the command substitution during the parsing stage is encapsulated into a single function (parse_comsub
). This function has to know about here documents, shell comments, metacharacters and word boundaries, quoting, and when reserved words are acceptable (so it knows when it’s in a case
statement); it took a while to get that right. When expanding a command substitution during word expansion, bash uses the parser to find the correct end of the construct, that is a right parenthesis.
The parser returns a single C structure representing a command (which, in the case of compound commands like loops, may include other commands in turn) and passes it to the next stage of the shell’s operation: word expansion. The command structure is composed of command objects and lists of words.
Word expansions (Fig. 3.2) are done in a peculiar order, with the last step allowing four expansions to run in parallel. As previously mentionned, command (and process) substitution requires the shell to use the parser and execute the corresponding command in a subshell, using its output to replace the expression previously occupied by the construct. This participate in interwinded steps and context-dependant analysis during shells text processing.
3.4. Call expression structure¶
Note
See the Section 4 for details on how cmdse tools should parse call expressions.
cmdse tooling will provide a static analysis of given snippets to infer some understanding of invoked utility executables and their arguments. Given the dynamic nature of unix shell input processing and the context-dependent syntax analysis involved (Section 3.3), there is no guarantee that there will be a perfect match between information gathered during static analysis and runtime effective invocations. The “unit of work” to isolate such runtime invocations is reffered to as a call expression. A call expression is a section of the command snippet close to the definition of a bash simple command [1]. Here is a classic example:
ls -la /usr/bin
3.4.1. Static call expressions¶
When a “context-free” situation is meeted, the call expression is considered “static”. The identification of elements in such a static call expression is done after static expansion, that is after static variable expansions are proceeded. A rudimentary formal definition is provided in the bellow figure (Listing 3.1) given a context-free situation.
COMMAND-IDENTIFIER = (ALPHA / DIGIT) *(ALPHA / DIGIT / HYPHEN / UNDERSCORE)
ARGUMENT = WORD
CALL-EXPRESSION = *(ASSIGNMENT) COMMAND-IDENTIFIER *(ARGUMENT)
Note
See the Shared definitions for ABNF grammars and Bash V2 grammar documents for the depending token definitions.
To qualify as “static”, a call expression must meet the following constrains:
- the command identifier is not the result of word expansion, unless after a double-dash[3]
- expanding variables and positionnal parameters are double-quoted to be isolated as a single argument, unless after a double-dash[3]
- command substitutions are double-quoted to be isolated as a single argument, unless after a double-dash[3]
- tilde and path expansions are allowed
- variable expansions can be unquoted for a list of options for example, but a static assignment must be provided in the snippet
An assignment is considered static if it follows those constrains:
- it is not part of a call expression
- it is not embedded in a subshell, such as command or process substitution
- variable and positionnal parameter expansions are double-quoted
Examples :
# OK, positionnal parameter quoted
echo "$1"
# Not OK, positionnal parameter unquoted
echo $1
# OK, positionnal parameter unquoted after double-dash
grep -- -v $1
# OK, options are unquoted but expanded to a static assignment
MY_OPTS="--summarize --human-readable"
du $MY_OPTS "$1"
# OK
# - options are unquoted but expanded to a static assignment
# - positionnal argument unquoted but after double-dash
DU_OPTS="--summarize --human-readable"
du $DU_OPTS -- $1
3.4.2. Command identifier¶
A command identifier will be ultimately resolved to a builtin command or a utility name. Within the unix system, the mapping between the command identifier and the utility executable is bijective, that is there is exactly one executable that can be matched from its identifier, and reciprocically, there is exactly one identifier that can be matched from an executable[4].
However, from cmdse perspective, the association must be done with a loosly defined utility interface model and is therefore non-bijective. First, because multiple programs can hold the same utility name. Second, because this mapping is done in the context of analysing a static call expression, and the association will be considered valid for a peculiar version range of the program supporting some set of options.
3.4.3. Arguments¶
Arguments are words following the command identifier. Discriminating between option expressions and operands and giving semantics to each argument is a central aspect of cmdse to fulfill its pedagogical goal.
3.4.3.1. Option expressions¶
Option expressions resolve to option assignment to the program. There is a great variety of expectable expressions, see Section 2.2.1.
3.5. Analytic model¶
to be written
3.6. Forks¶
to be written
3.6.1. Sub-command snippet¶
to be written
3.6.2. Commmand snippet variant¶
to be written
3.6.3. Alias¶
to be written
[1] | bash(1) |
[2] | Four exceptions: multiple lines can be processed in one row when terminated with the escape character, \ and here-documents are read multilines until the provided WORD is matched. Also compound commands such as for construct may be written in multiple lines, needing some look-ahead line processing before execution. Finally, the semicolon ; metacharacter is interpreted as a line delimiter. |
[3] | (1, 2, 3) In a great number of bash builtin commands and unix programs, the double-dash -- is a signal to inform that any upcoming argument should be treated as an operand. This behavior is implemented by the getopt(3) GNU function, which documentation states that “the special argument ‘–’ forces an end of option-scanning”. |
[4] | The shell will resolve the first utility executable that matches the utility name while iterating over each path expression hold in the PATH variable. So this executable should be considered the one and only valid executable. |
[5] | See POSIX.1-2008, sec. 2.5.1 |