4. Call expressions parsing¶

To add semantics, the call expression parser will try to guess the nature of arguments by converting them into tokens. After givent some context, such tokens will be assembled to metadata, which is comprised of:

option assignment expressions and a description of left-side and value parts
standalone option expressions and a description of the underlying option
command operands and a their description

4.1. Parsing workflow¶

The Fig. 4.1 shows an overview of the different steps involved in call expression parsing. Those steps are grouped into higher-level steps (A, B, C). The core of call expression parsing is done in B through tokenization (see Section 4.2 for a better understanding on token typing). But some static bash analysis must be done upstream (A, see Section 3.4 for more details about this step). After parsing, the call expression must be assembled to form a metadata structure (C).

$@startuml !include styles.puml start (A) partition "Static shell parsing (A)" { :(A.1) Static call expression identified by shell static analyser; :(A.2) Call expression expanded; :(A.3) Call expression word-splitted; } (B) partition "Tokenization (B)" { :(B.1) Arguments of call expression converted to a list of tokens with context-free types; :(B.2) Check for the existence of a Program Interface Model (UIM) associated with the command identifier; repeat repeat :(B.3) For each token which type is non-semantic, transform its type to semantic when inference is possible. Use UIM if available; repeat while (at least one transformation\n occured in the last two parses?) is (yes) if (token list contains\nnon-semantic tokens) then (yes) :(B.4) Prompt the user to map the leftmost non-semantic token to a single semantic token ; else (no) (C) detach endif repeat while (token list contains\nnon-semantic tokens) is (yes) } (C) partition "Assembly (C)" { :(C.1) Reassemble tokens into structured option expressions assignments, standalone option expressions and command operands. Reassembling is done with splitting explicit assignments and grouping implicit assignments; :(C.2) Build call expression metadata from assembled expressions; stop } @enduml$

Fig. 4.1 Call expression parsing dataflow¶

4.2. Tokenization¶

The first step consists in creating a list of tokens that maps the command arguments (Fig. 4.1, item B.1). The token types will be updated thanks to basic inference rules and command meta-information. These token types are first assigned to “context-free” tokens (see Table 4.1 for a listing). “Context-free” means that their nature can be captured without the need for information about their siblings or position, and is therefore trivial.

In a second step (Fig. 4.1, item B.3), token types are assigned to “semantic token type” values (Table 4.3) given some inference rules and information extracted from the utility interface model (UIM, Fig. 4.1, item B.2). The underlying algorithm is described in details in Section 4.4.

When semantic type cannot be inferred, a prompt to the user is processed (Fig. 4.1, item B.4).

4.2.1. Context-free tokens typings¶

The Table 4.1 shows a list of the context-free token types. In the last column, a list of semantic type candidates is provided. This list shows which semantic types this context-free type can be transformed to. Some of these context-free token types overlap semantic token types, because they have only one semantic candidate (resolved to self). They are considered “non-ambiguous” and don’t need further transformation.

Table 4.1 Context-free token types¶
Context-free token type	Is option flag?	Examples given in brackets “[]”	Semantic type candidates
`POSIX_SHORT_STICKY_VALUE`	yes	`[-o<int-value>]`	self
`GNU_EXPLICIT_ASSIGNMENT`	yes	`[--option=<value>]`	self
`X2LKT_EXPLICIT_ASSIGNMENT`	yes	`[-option=<value>]`	self
`X2LKT_REVERSE_SWITCH`	yes	`[+option]`	self
`POSIX_END_OF_OPTIONS`	yes	`[--]`	self
`ONE_DASH_LETTER`	yes	`[-o] <value>` `[-o]`	`POSIX_SHORT_ASSIGNMENT_LEFT_SIDE` `POSIX_SHORT_SWITCH`
`ONE_DASH_WORD_ALPHANUM`	yes	[-opq]` [-option]`	`POSIX_GROUPED_SHORT_FLAGS` `X2LKT_SWITCH` `X2LKT_IMPLICIT_ASSIGNEMNT_LEFT_SIDE`
`ONE_DASH_WORD`	yes	`[-long-option]` `[-long-option] <value>`	`X2LKT_SWITCH` `X2LKT_IMPLICIT_ASSIGNEMNT_LEFT_SIDE`
`TWO_DASH_WORD`	yes	`[--option]`	`GNU_SWITCH` `GNU_IMPLICIT_ASSIGNMENT_LEFT_SIDE`
`OPT_WORD`	no[1]	`-o [<value>]` `--option [<value>]` `-option [<value>]` `option`	`OPERAND` `POSIX_SHORT_ASSIGNMENT_VALUE` `GNU_IMPLICIT_ASSIGNMENT_VALUE` `X2LKT_IMPLICIT_ASSIGNMENT_VALUE` `HEADLESS_OPTION`
`WORD`	no	`ls [~/]` `-o /some/file` `--option /some/files` `-option /some/file`	`OPERAND` `POSIX_SHORT_ASSIGNMENT_VALUE` `GNU_IMPLICIT_ASSIGNMENT_VALUE` `X2LKT_IMPLICIT_ASSIGNMENT_VALUE`

4.2.2. Semantic tokens typings¶

Note

See the Section 2.2.1 for details on the existing option expression styles from which a majority of those semantic token types are derived.

The Table 4.3 shows a list of the semantic token types. Those types have a positional model (Table 4.2) from which rules can be inferred. For example of such inferences, in the call expression find . -type file, “file” would be a token which positional model is OPT_IMPLICIT_ASSIGNMENT_VALUE and type X2LKT_IMPLICIT_ASSIGNMENT_VALUE and “-type” a OPT_IMPLICIT_ASSIGNMENT_LEFT_SIDE of type X2LKT_IMPLICIT_ASSIGNEMNT_LEFT_SIDE.

Table 4.2 Token positional model¶
Positionnal model name	Description	Binding	is “option part”	is “option flag”	is “semantic”
`OPT_IMPLICIT_ASSIGNMENT_LEFT_SIDE`	The left side of an implicit option assignment in the form `left-side <value>`.	right	yes	yes	yes
`OPT_IMPLICIT_ASSIGNMENT_VALUE`	The right side of an implicit option assignment in the form `left-side <value>`.	left	yes	no	yes
`STANDALONE_OPT_ASSIGNMENT`	A token option with value assignment.	none	yes	yes	yes
`OPT_SWITCH`	An option switch, that is without value.	none	yes	yes	yes
`COMMAND_OPERAND`	A command operand.	none	no	no	yes
`UNSET`	Positional model unset.	inferred	inferred	inferred	false

In the Table 4.2, the first 5 models are applicable for semantic token types, while the latest is applicable for context-free types. The attributes of the latest are dynamically inferred regarding the set of semantic candidates associated with a token instance. For example, if a context-free type has semantic candidates which positionnal model all have is “option part” set to true, it will infer the attribute to true.

Table 4.3 Semantic token types¶
Semantic token type	Example, given in brackets, “[]”	Positional model
`X2LKT_REVERSE_SWITCH`	`[+option]`	`OPT_SWITCH`
`POSIX_SHORT_SWITCH`	`[-o]`	`OPT_SWITCH`
`POSIX_GROUPED_SHORT_FLAGS`	`[-opq]`	`OPT_SWITCH`
`POSIX_SHORT_ASSIGNMENT_LEFT_SIDE`	`[-o] <value>`	`OPT_IMPLICIT_ASSIGNMENT_LEFT_SIDE`
`POSIX_SHORT_ASSIGNMENT_VALUE`	`-o [<value>]`	`OPT_IMPLICIT_ASSIGNMENT_VALUE`
`POSIX_SHORT_STICKY_VALUE`	`[-o<value>]`	`STANDALONE_OPT_ASSIGNMENT`
`X2LKT_SWITCH`	`[-option]`	`OPT_SWITCH`
`X2LKT_IMPLICIT_ASSIGNEMNT_LEFT_SIDE`	`[-option] <value>`	`OPT_IMPLICIT_ASSIGNMENT_LEFT_SIDE`
`X2LKT_IMPLICIT_ASSIGNMENT_VALUE`	`-option [<value>]`	`OPT_IMPLICIT_ASSIGNMENT_VALUE`
`X2LKT_EXPLICIT_ASSIGNMENT`	`[-option=<value>]`	`STANDALONE_OPT_ASSIGNMENT`
`GNU_SWITCH`	`--option`	`OPT_SWITCH`
`GNU_IMPLICIT_ASSIGNMENT_LEFT_SIDE`	`[--option] <value>`	`OPT_IMPLICIT_ASSIGNMENT_LEFT_SIDE`
`GNU_IMPLICIT_ASSIGNMENT_VALUE`	`--option [<value>]`	`OPT_IMPLICIT_ASSIGNMENT_VALUE`
`GNU_EXPLICIT_ASSIGNMENT`	`[--option=<value>]`	`STANDALONE_OPT_ASSIGNMENT`
`POSIX_END_OF_OPTIONS`	`[--]`	`OPT_SWITCH`
`OPERAND`	`[<operand>]`	`COMMAND_OPERAND`
`HEADLESS_OPTION`	`[option]`	`OPT_SWITCH`

4.3. Analytic Model¶

$@startuml !include styles.puml class Program { + String projectURL + String commandIdentifier } class ProgramInterfaceModel { Program program OptionScheme optionScheme OptionDescription[] optionDescriptions } enum TokenPositionalModel { + Binding binding = 'UNKNOWN' | 'NONE' | 'LEFT' | 'RIGHT' + Bool isSemantic - Bool isOptionFlag - Bool isOptionPart ..models.. OPT_IMPLICIT_ASSIGNMENT_LEFT_SIDE OPT_IMPLICIT_ASSIGNMENT_VALUE STANDALONE_OPT_ASSIGNMENT OPT_SWITCH COMMAND_OPERAND UNSET } enum OptionExpressionVariant { Regex flagRegex TokenType flagType Optional<TokenType> valueType OptionStyle style = 'POSIX' | 'XTOOLKIT' | 'GNU' | 'NONE' .. variants .. POSIX_SHORT_SWITCH POSIX_GROUPED_SHORT_FLAGS POSIX_SHORT_ASSIGNMENT POSIX_SHORT_STICKY_VALUE X2LKT_SWITCH X2LKT_REVERSE_SWITCH X2LKT_IMPLICIT_ASSIGNMENT X2LKT_EXPLICIT_ASSIGNMENT GNU_SWITCH GNU_IMPLICIT_ASSIGNMENT GNU_EXPLICIT_ASSIGNMENT POSIX_END_OF_OPTIONS HEADLESS_OPTION } class OptionScheme { OptionExpressionVariant[] variants } class OptionDescription { + OptionExpressionVariants[] supportedVariants + ValueModel valueModel = 'NONE' | 'OPTIONAL' | 'MANDATORY' + String description + Optional<TokenType> matchDescription(Token token) } class CallExpression { + String commandIdentifier + String[] arguments + String raw + LineRange lines } class Token { + Int argumentPosition + TokenType type + String value + Token boundTo + OptionDescription optionDescription + TokenType[] semanticCandidates + PositionalModel[] posModelCandidates() + Bool isOptionFlag() + Bool isOptionPart() + Bool isBoundToOneOf(Binding[] bindings) + Bool isBoundTo(Binding binding) + Bool matchOptionDescription(OptionDescription[] options) + Bool reduceCandidatesWithScheme(OptionScheme scheme) } class CallExpressionMetadata { CallExpression callExpression OptionExpression[] optionExpressions Operand[] operands Token[] tokens } enum TokenType { + PositionalModel posModel + Bool isSemantic() } enum ContextFreeTokenType { + SemanticTokenType[] semanticCandidates ----- .. ContextFree and Semantic .. X2LKT_REVERSE_SWITCH GNU_EXPLICIT_ASSIGNMENT X2LKT_EXPLICIT_ASSIGNMENT POSIX_END_OF_OPTIONS .. Strictly ContextFree .. ONE_DASH_WORD ONE_DASH_LETTER TWO_DASH_WORD WORD } note "isOption* is resolved to type.posModel.isOption* \nwhen type.posModel is not UNSET or to true when \n '∀c ∈ {posModelCandidates}, c.isOption* = true', false otherwise.\nSeemingly, isBoundToOneOf is resolved to \n'token.type.posModelbinding.binding ∈ {bindings}'\nwhen posModel is not UNSET, otherwise to\n'{bindings} ∩ {token.posModelCandidates} = {bindings}'." as N2 Token .. N2 N2 .. TokenPositionalModel enum SemanticTokenType { + OptionExpressionVariant variant ----- .. ContextFree and Semantic .. X2LKT_REVERSE_SWITCH GNU_EXPLICIT_ASSIGNMENT X2LKT_EXPLICIT_ASSIGNMENT POSIX_END_OF_OPTIONS .. Strictly Semantic .. POSIX_SHORT_SWITCH POSIX_GROUPED_SHORT_FLAGS POSIX_SHORT_ASSIGNMENT_LEFT_SIDE POSIX_SHORT_ASSIGNMENT_VALUE POSIX_SHORT_STICKY_VALUE GNU_IMPLICIT_ASSIGNMENT_LEFT_SIDE GNU_IMPLICIT_ASSIGNMENT_VALUE X2LKT_SWITCH X2LKT_IMPLICIT_ASSIGNEMNT_LEFT_SIDE X2LKT_IMPLICIT_ASSIGNMENT_VALUE OPERAND HEADLESS_OPTION } class Parser { CallExpressionMetadata parse(CallExpression callExpression) } TokenType <|-- ContextFreeTokenType TokenType <|-- SemanticTokenType OptionDescription o-- ProgramInterfaceModel OptionExpressionVariant o-- OptionScheme OptionExpressionVariant o-- OptionDescription OptionExpressionVariant o--o TokenType OptionScheme o-- ProgramInterfaceModel TokenType "1" *-- "*" Token TokenPositionalModel *-- TokenType OptionDescription "?" o-- "*" Token CallExpression o-- Parser Token o-- CallExpressionMetadata CallExpressionMetadata o-- Parser ProgramInterfaceModel o-- Parser Program o-- ProgramInterfaceModel @enduml$

4.4. Option parsing algorithm¶

This section offers an in-depth look at tokenization (B) step from Fig. 4.1. The parser will hold in memory a list of tokens (Fig. 4.2). Each of these starts with a context-free type. The parser’s job is considered done when all tokens hold a semantic type. To get there, it will proceed with the following steps :

Initiate the token list with the result of mapping arguments to context-free token generation.
Fetch the utility interface model (UIM) if it exists.
Provide the list and the UIM as arguments of the parse function (Fig. 4.3). Such function will do the following:
1. Check for the existence of an POSIX_END_OF_OPTIONS typed token (Fig. 4.4) and convert to operands all remaining tokens to the right.
2. Repeat the following operation until the last two operations didn’t turn out to at least one context-free to semantic conversion:
  
  For each non-semantic token, inferRight (Fig. 4.5) and inferLeft (Fig. 4.6). Those functions will try to infer the semantic type by checking its siblings’. For example, if the left sibling token type is X2LKT_IMPLICIT_ASSIGNEMNT_LEFT_SIDE, the only possible type for this token would be X2LKT_IMPLICIT_ASSIGNMENT_VALUE. If the token type is “option part”, use the option descriptions from the UIM to try an exact match (Fig. 4.8). For example, the token is --reverse, and the utility interface model contains an option description that exactly match --reverse. If no exact match is found, check for a pattern match with the option scheme (Fig. 4.9). For example, if the token -pq is encountered, and the program option scheme is “Linux-Standard-Explicit” (see Table 2.2), the only possible mapping for ONE_DASH_WORD will be POSIX_GROUPED_SHORT_FLAGS. Finally, increment conversions if the token type “is semantic”.
Until all tokens are of “semantic” type, prompt the user for a token type annotation and loop back at 3.2.

Fig. 4.3 Parse function

Fig. 4.4 CheckEndOfOptions function

Fig. 4.5 InferRight function

Fig. 4.6 InferLeft function

Fig. 4.7 ConvertToSemantic function

../../_images/matchOptionDescription.png

Fig. 4.8 MatchOptionDescription function

../../_images/reduceCandidatesWithScheme.png

Fig. 4.9 ReduceCandidatesWithScheme function

4.5. Edge cases and extension perspectives¶

Some argument constructs must be anticipated, so here is a list of problematic examples to open to further enhancements:

How to model restricted operands such as in dd(1)? Although they look like headless options, dd operands are “typed”.
How to model commands which operands can be another command, such as find -exec <command> {} ; ?

[1]	Although `HEADLESS_OPTION` is an option, it is very rare and should only be matched when defined in a utility interface model, or reviewed by the user. So, by default we assume a `WORD` is not an option.