3. Command snippet model

3.1. General Requirements

Target Shell

The command snippets handled by cmdse tools should be of bash and POSIX shell dialects.

Command Snippet Endoding

The command snippets text should be stored, processed and transported in US-ASCII encoding.

Command Snippet Metadata Encoding

The command snippet metadata text fields should be stored, processed and transported in US-ASCII encoding.

3.2. Command snippet requirements

Valid Command Snippet

A command snippet text must meet the following carracteristics to be valid:

  • Must be a sequence of US-ASCII character literals
  • Must be valid POSIX / bash v4 shell dialect

Command Parameters

A command parameter is denoted with POSIX-shell positional parameters[5] syntax: $1 .. $9 or their in-braces equivalent ${1}.

3.3. Shell processing workflow

The way cmdse tools handle snippets and extract a great deal of information to the end-user requires a good understanding on how unix shells process text into commands.

Note

Some paragraphs of this section are greedly copied from The Bourne-Again Shell in “The Architecture of Open Source Applications”

The first stage of the processing pipeline is input processing (Fig. 3.1): taking characters from the terminal or a file, breaking them into lines, and passing the lines to the shell parser to transform into commands. The lines are sequences of characters terminated by newlines[2]. Such a shell-processable line is referred to as a command.

digraph processingpipeline { rankdir=TB; graph[ranksep=1,compound=true,splines = spline]; {rank=same; PS WE WS GL} {rank=same; CS CE} PS [label=<parsing <BR/> <FONT POINT-SIZE="10">Extract semantics</FONT>>]; LA [shape=box,label=<lexical analyser <BR/> <FONT POINT-SIZE="10">Identify tokens delemited <BR/> with metacharacters</FONT>>]; PA [shape=box,label=<parser <BR/> <FONT POINT-SIZE="10">Make sense of tokens</FONT>>]; IP [shape=doubleoctagon,label=<input processing <BR/> <FONT POINT-SIZE="10">Grab one line</FONT>>]; WE [label="word expansion"]; EX [label="execution"]; CS [label=<command substitution<BR/> <FONT POINT-SIZE="10" FACE="monospace">parse_comsub</FONT>>]; CE [label="command extraction"]; WS [label="word splitting"]; GL [label="pathname expansion"]; IP -> PS -> WE -> WS -> GL -> CE -> EX; subgraph cluster_parsing { style=filled; color="#f5f5f0"; PS -> PA [dir=none,style=dotted,lhead=cluster_parsing]; PA -> LA [dir=both,style=dashed]; } CS -> PA [dir=none,style=dotted,lhead=cluster_parsing]; WE -> CS; EX -> CS [style=dashed,dir=both]; }

Fig. 3.1 Bash processing pipeline

The second step is parsing. The initial job of the parsing engine is lexical analysis: to separate the stream of characters into words and apply meaning to the result. The word is the basic unit on which the parser operates. Words are sequences of characters separated by metacharacters, which include simple separators like spaces and tabs, or characters that are special to the shell language, like semicolons and ampersands.

The lexical analyzer takes lines of input, breaks them into tokens at metacharacters, identifies the tokens based on context, and passes them on to the parser to be assembled into statements and commands. There is a lot of context involved—for instance, the word for can be a reserved word, an identifier, part of an assignment statement, or other word, and the following is a perfectly valid command:

for for in for; do for=for; done; echo $for

that displays for.

The parser encodes a certain amount of state and shares it with the analyzer to allow the sort of context-dependent analysis the grammar requires. For example, the lexical analyzer categorizes words according to the token type: reserved word (in the appropriate context), word, assignment statement, and so on. In order to do this, the parser has to tell it something about how far it has progressed parsing a command, whether it is processing a multiline string (sometimes called a “here-document”), whether it’s in a case statement or a conditional command, or whether it is processing an extended shell pattern or compound assignment statement.

Much of the work to recognize the end of the command substitution during the parsing stage is encapsulated into a single function (parse_comsub). This function has to know about here documents, shell comments, metacharacters and word boundaries, quoting, and when reserved words are acceptable (so it knows when it’s in a case statement); it took a while to get that right. When expanding a command substitution during word expansion, bash uses the parser to find the correct end of the construct, that is a right parenthesis.

The parser returns a single C structure representing a command (which, in the case of compound commands like loops, may include other commands in turn) and passes it to the next stage of the shell’s operation: word expansion. The command structure is composed of command objects and lists of words.

digraph wordexpansions { graph[ranksep=1,compound=true,splines=spline]; node[shape="plaintext"]; PVE [label=<parameter expansion <BR/><BR/><FONT POINT-SIZE="10" FACE="monospace">$PARAM<BR/>${PARAM:...}</FONT>>]; ARE [label=<arithmetic expansion<BR/><BR/><FONT POINT-SIZE="10" FACE="monospace">$(( EXPRESSION ))<BR/> $[ EXPRESSION ]</FONT>>]; CMS [label=<command substitution<BR/><BR/><FONT POINT-SIZE="10" FACE="monospace">$( COMMAND )<BR/>`COMMAND`</FONT>>]; TLE [label=<tilde expansion<BR/><BR/><FONT POINT-SIZE="10" FACE="monospace">~<BR/>~+<BR/>~-</FONT>>]; BRE [label=<brace expansion<BR/><BR/><FONT POINT-SIZE="10" FACE="monospace">{a,b,c}</FONT>>]; PRS [label=<process substitution<BR/><BR/><FONT POINT-SIZE="10" FACE="monospace">&lt;(COMMAND)</FONT>>]; BRE -> TLE; TLE -> PVE; TLE -> ARE; TLE -> CMS; TLE -> PRS; }

Fig. 3.2 Bash word expansions order

Word expansions (Fig. 3.2) are done in a peculiar order, with the last step allowing four expansions to run in parallel. As previously mentionned, command (and process) substitution requires the shell to use the parser and execute the corresponding command in a subshell, using its output to replace the expression previously occupied by the construct. This participate in interwinded steps and context-dependant analysis during shells text processing.

3.4. Call expression structure

Note

See the Section 4 for details on how cmdse tools should parse call expressions.

cmdse tooling will provide a static analysis of given snippets to infer some understanding of invoked utility executables and their arguments. Given the dynamic nature of unix shell input processing and the context-dependent syntax analysis involved (Section 3.3), there is no guarantee that there will be a perfect match between information gathered during static analysis and runtime effective invocations. The “unit of work” to isolate such runtime invocations is reffered to as a call expression. A call expression is a section of the command snippet close to the definition of a bash simple command [1]. Here is a classic example:

ls -la /usr/bin

3.4.1. Static call expressions

When a “context-free” situation is meeted, the call expression is considered “static”. The identification of elements in such a static call expression is done after static expansion, that is after static variable expansions are proceeded. A rudimentary formal definition is provided in the bellow figure (Listing 3.1) given a context-free situation.

Listing 3.1 Static call expression formal ABNF syntax definition
COMMAND-IDENTIFIER       = (ALPHA / DIGIT) *(ALPHA / DIGIT / HYPHEN / UNDERSCORE)
ARGUMENT                 = WORD
CALL-EXPRESSION          = *(ASSIGNMENT) COMMAND-IDENTIFIER *(ARGUMENT)

Note

See the Shared definitions for ABNF grammars and Bash V2 grammar documents for the depending token definitions.

To qualify as “static”, a call expression must meet the following constrains:

  • the command identifier is not the result of word expansion, unless after a double-dash[3]
  • expanding variables and positionnal parameters are double-quoted to be isolated as a single argument, unless after a double-dash[3]
  • command substitutions are double-quoted to be isolated as a single argument, unless after a double-dash[3]
  • tilde and path expansions are allowed
  • variable expansions can be unquoted for a list of options for example, but a static assignment must be provided in the snippet

An assignment is considered static if it follows those constrains:

  • it is not part of a call expression
  • it is not embedded in a subshell, such as command or process substitution
  • variable and positionnal parameter expansions are double-quoted

Examples :

# OK, positionnal parameter quoted
echo "$1"
# Not OK, positionnal parameter unquoted
echo $1
# OK, positionnal parameter unquoted after double-dash
grep -- -v $1
# OK, options are unquoted but expanded to a static assignment
MY_OPTS="--summarize --human-readable"
du $MY_OPTS "$1"
# OK
# - options are unquoted but expanded to a static assignment
# - positionnal argument unquoted but after double-dash
DU_OPTS="--summarize --human-readable"
du $DU_OPTS -- $1

3.4.2. Command identifier

A command identifier will be ultimately resolved to a builtin command or a utility name. Within the unix system, the mapping between the command identifier and the utility executable is bijective, that is there is exactly one executable that can be matched from its identifier, and reciprocically, there is exactly one identifier that can be matched from an executable[4].

However, from cmdse perspective, the association must be done with a loosly defined utility interface model and is therefore non-bijective. First, because multiple programs can hold the same utility name. Second, because this mapping is done in the context of analysing a static call expression, and the association will be considered valid for a peculiar version range of the program supporting some set of options.

3.4.3. Arguments

Arguments are words following the command identifier. Discriminating between option expressions and operands and giving semantics to each argument is a central aspect of cmdse to fulfill its pedagogical goal.

3.4.3.1. Option expressions

Option expressions resolve to option assignment to the program. There is a great variety of expectable expressions, see Section 2.2.1.

3.4.3.2. Operands

Operands are the subject upon which the program operates.

3.5. Analytic model

to be written

3.6. Forks

to be written

3.6.1. Sub-command snippet

to be written

3.6.2. Commmand snippet variant

to be written

3.6.3. Alias

to be written


[1]bash(1)
[2]Four exceptions: multiple lines can be processed in one row when terminated with the escape character, \ and here-documents are read multilines until the provided WORD is matched. Also compound commands such as for construct may be written in multiple lines, needing some look-ahead line processing before execution. Finally, the semicolon ; metacharacter is interpreted as a line delimiter.
[3](1, 2, 3) In a great number of bash builtin commands and unix programs, the double-dash -- is a signal to inform that any upcoming argument should be treated as an operand. This behavior is implemented by the getopt(3) GNU function, which documentation states that “the special argument ‘–’ forces an end of option-scanning”.
[4]The shell will resolve the first utility executable that matches the utility name while iterating over each path expression hold in the PATH variable. So this executable should be considered the one and only valid executable.
[5]See POSIX.1-2008, sec. 2.5.1