4. Call expressions parsing¶
To add semantics, the call expression parser will try to guess the nature of arguments by converting them into tokens. After givent some context, such tokens will be assembled to metadata, which is comprised of:
- option assignment expressions and a description of left-side and value parts
- standalone option expressions and a description of the underlying option
- command operands and a their description
4.1. Parsing workflow¶
The Fig. 4.1 shows an overview of the different steps involved in call expression parsing. Those steps are grouped into higher-level steps (A, B, C). The core of call expression parsing is done in B through tokenization (see Section 4.2 for a better understanding on token typing). But some static bash analysis must be done upstream (A, see Section 3.4 for more details about this step). After parsing, the call expression must be assembled to form a metadata structure (C).
Fig. 4.1 Call expression parsing dataflow¶
4.2. Tokenization¶
The first step consists in creating a list of tokens that maps the command arguments (Fig. 4.1, item B.1). The token types will be updated thanks to basic inference rules and command meta-information. These token types are first assigned to “context-free” tokens (see Table 4.1 for a listing). “Context-free” means that their nature can be captured without the need for information about their siblings or position, and is therefore trivial.
In a second step (Fig. 4.1, item B.3), token types are assigned to “semantic token type” values (Table 4.3) given some inference rules and information extracted from the utility interface model (UIM, Fig. 4.1, item B.2). The underlying algorithm is described in details in Section 4.4.
When semantic type cannot be inferred, a prompt to the user is processed (Fig. 4.1, item B.4).
4.2.1. Context-free tokens typings¶
The Table 4.1 shows a list of the context-free token types. In the last column, a list of semantic type candidates is provided. This list shows which semantic types this context-free type can be transformed to. Some of these context-free token types overlap semantic token types, because they have only one semantic candidate (resolved to self). They are considered “non-ambiguous” and don’t need further transformation.
| Context-free token type | Is option flag? | Examples
given in
brackets “[]”
|
Semantic type candidates |
|---|---|---|---|
POSIX_SHORT_STICKY_VALUE |
yes | [-o<int-value>] |
self |
GNU_EXPLICIT_ASSIGNMENT |
yes | [--option=<value>] |
self |
X2LKT_EXPLICIT_ASSIGNMENT |
yes | [-option=<value>] |
self |
X2LKT_REVERSE_SWITCH |
yes | [+option] |
self |
POSIX_END_OF_OPTIONS |
yes | [--] |
self |
ONE_DASH_LETTER |
yes | [-o] <value>[-o] |
|
ONE_DASH_WORD_ALPHANUM |
yes | [-opq]`[-option]` |
|
ONE_DASH_WORD |
yes | [-long-option][-long-option] <value> |
|
TWO_DASH_WORD |
yes | [--option] |
|
OPT_WORD |
no[1] | -o [<value>]--option [<value>]-option [<value>]option |
|
WORD |
no | ls [~/]-o /some/file--option /some/files-option /some/file |
|
4.2.2. Semantic tokens typings¶
Note
See the Section 2.2.1 for details on the existing option expression styles from which a majority of those semantic token types are derived.
The Table 4.3 shows a list of the semantic token types. Those types have a positional model (Table 4.2) from which rules can be inferred.
For example of such inferences, in the call expression find . -type file, “file” would be a token which positional model is OPT_IMPLICIT_ASSIGNMENT_VALUE and type X2LKT_IMPLICIT_ASSIGNMENT_VALUE and “-type” a OPT_IMPLICIT_ASSIGNMENT_LEFT_SIDE of type X2LKT_IMPLICIT_ASSIGNEMNT_LEFT_SIDE.
| Positionnal model name | Description | Binding | is
“option part”
|
is
“option flag”
|
is
“semantic”
|
|---|---|---|---|---|---|
OPT_IMPLICIT_ASSIGNMENT_LEFT_SIDE |
The left side of an implicit option assignment in the form left-side <value>. |
right | yes | yes | yes |
OPT_IMPLICIT_ASSIGNMENT_VALUE |
The right side of an implicit option assignment in the form left-side <value>. |
left | yes | no | yes |
STANDALONE_OPT_ASSIGNMENT |
A token option with value assignment. | none | yes | yes | yes |
OPT_SWITCH |
An option switch, that is without value. | none | yes | yes | yes |
COMMAND_OPERAND |
A command operand. | none | no | no | yes |
UNSET |
Positional model unset. | inferred | inferred | inferred | false |
In the Table 4.2, the first 5 models are applicable for semantic token types, while the latest is applicable for context-free types. The attributes of the latest are dynamically inferred regarding the set of semantic candidates associated with a token instance. For example, if a context-free type has semantic candidates which positionnal model all have is “option part” set to true, it will infer the attribute to true.
| Semantic token type | Example, given in brackets, “[]”
|
Positional model
|
|---|---|---|
X2LKT_REVERSE_SWITCH |
[+option] |
OPT_SWITCH |
POSIX_SHORT_SWITCH |
[-o] |
OPT_SWITCH |
POSIX_GROUPED_SHORT_FLAGS |
[-opq] |
OPT_SWITCH |
POSIX_SHORT_ASSIGNMENT_LEFT_SIDE |
[-o] <value> |
OPT_IMPLICIT_ASSIGNMENT_LEFT_SIDE |
POSIX_SHORT_ASSIGNMENT_VALUE |
-o [<value>] |
OPT_IMPLICIT_ASSIGNMENT_VALUE |
POSIX_SHORT_STICKY_VALUE |
[-o<value>] |
STANDALONE_OPT_ASSIGNMENT |
X2LKT_SWITCH |
[-option] |
OPT_SWITCH |
X2LKT_IMPLICIT_ASSIGNEMNT_LEFT_SIDE |
[-option] <value> |
OPT_IMPLICIT_ASSIGNMENT_LEFT_SIDE |
X2LKT_IMPLICIT_ASSIGNMENT_VALUE |
-option [<value>] |
OPT_IMPLICIT_ASSIGNMENT_VALUE |
X2LKT_EXPLICIT_ASSIGNMENT |
[-option=<value>] |
STANDALONE_OPT_ASSIGNMENT |
GNU_SWITCH |
--option |
OPT_SWITCH |
GNU_IMPLICIT_ASSIGNMENT_LEFT_SIDE |
[--option] <value> |
OPT_IMPLICIT_ASSIGNMENT_LEFT_SIDE |
GNU_IMPLICIT_ASSIGNMENT_VALUE |
--option [<value>] |
OPT_IMPLICIT_ASSIGNMENT_VALUE |
GNU_EXPLICIT_ASSIGNMENT |
[--option=<value>] |
STANDALONE_OPT_ASSIGNMENT |
POSIX_END_OF_OPTIONS |
[--] |
OPT_SWITCH |
OPERAND |
[<operand>] |
COMMAND_OPERAND |
HEADLESS_OPTION |
[option] |
OPT_SWITCH |
4.3. Analytic Model¶
4.4. Option parsing algorithm¶
This section offers an in-depth look at tokenization (B) step from Fig. 4.1. The parser will hold in memory a list of tokens (Fig. 4.2). Each of these starts with a context-free type. The parser’s job is considered done when all tokens hold a semantic type. To get there, it will proceed with the following steps :
Initiate the token list with the result of mapping arguments to context-free token generation.
Fetch the utility interface model (UIM) if it exists.
Provide the list and the UIM as arguments of the parse function (Fig. 4.3). Such function will do the following:
Check for the existence of an
POSIX_END_OF_OPTIONStyped token (Fig. 4.4) and convert to operands all remaining tokens to the right.Repeat the following operation until the last two operations didn’t turn out to at least one context-free to semantic conversion:
For each non-semantic token, inferRight (Fig. 4.5) and inferLeft (Fig. 4.6). Those functions will try to infer the semantic type by checking its siblings’. For example, if the left sibling token type is
X2LKT_IMPLICIT_ASSIGNEMNT_LEFT_SIDE, the only possible type for this token would beX2LKT_IMPLICIT_ASSIGNMENT_VALUE. If the token type is “option part”, use the option descriptions from the UIM to try an exact match (Fig. 4.8). For example, the token is--reverse, and the utility interface model contains an option description that exactly match--reverse. If no exact match is found, check for a pattern match with the option scheme (Fig. 4.9). For example, if the token-pqis encountered, and the program option scheme is “Linux-Standard-Explicit” (see Table 2.2), the only possible mapping forONE_DASH_WORDwill bePOSIX_GROUPED_SHORT_FLAGS. Finally, increment conversions if the token type “is semantic”.
Until all tokens are of “semantic” type, prompt the user for a token type annotation and loop back at 3.2.
Fig. 4.3 Parse function
Fig. 4.4 CheckEndOfOptions function
Fig. 4.5 InferRight function
Fig. 4.6 InferLeft function
Fig. 4.7 ConvertToSemantic function
Fig. 4.8 MatchOptionDescription function
Fig. 4.9 ReduceCandidatesWithScheme function
4.5. Edge cases and extension perspectives¶
Some argument constructs must be anticipated, so here is a list of problematic examples to open to further enhancements:
- How to model restricted operands such as in dd(1)? Although they look like headless options, dd operands are “typed”.
- How to model commands which operands can be another command, such as find -exec <command> {} ; ?
| [1] | Although HEADLESS_OPTION is an option, it is very rare and should only be matched when defined in a utility interface model, or reviewed by the user. So, by default we assume a WORD is not an option. |

![@startuml
!include styles.puml
class Program {
+ String projectURL
+ String commandIdentifier
}
class ProgramInterfaceModel {
Program program
OptionScheme optionScheme
OptionDescription[] optionDescriptions
}
enum TokenPositionalModel {
+ Binding binding = 'UNKNOWN' | 'NONE' | 'LEFT' | 'RIGHT'
+ Bool isSemantic
- Bool isOptionFlag
- Bool isOptionPart
..models..
OPT_IMPLICIT_ASSIGNMENT_LEFT_SIDE
OPT_IMPLICIT_ASSIGNMENT_VALUE
STANDALONE_OPT_ASSIGNMENT
OPT_SWITCH
COMMAND_OPERAND
UNSET
}
enum OptionExpressionVariant {
Regex flagRegex
TokenType flagType
Optional<TokenType> valueType
OptionStyle style = 'POSIX' | 'XTOOLKIT' | 'GNU' | 'NONE'
.. variants ..
POSIX_SHORT_SWITCH
POSIX_GROUPED_SHORT_FLAGS
POSIX_SHORT_ASSIGNMENT
POSIX_SHORT_STICKY_VALUE
X2LKT_SWITCH
X2LKT_REVERSE_SWITCH
X2LKT_IMPLICIT_ASSIGNMENT
X2LKT_EXPLICIT_ASSIGNMENT
GNU_SWITCH
GNU_IMPLICIT_ASSIGNMENT
GNU_EXPLICIT_ASSIGNMENT
POSIX_END_OF_OPTIONS
HEADLESS_OPTION
}
class OptionScheme {
OptionExpressionVariant[] variants
}
class OptionDescription {
+ OptionExpressionVariants[] supportedVariants
+ ValueModel valueModel = 'NONE' | 'OPTIONAL' | 'MANDATORY'
+ String description
+ Optional<TokenType> matchDescription(Token token)
}
class CallExpression {
+ String commandIdentifier
+ String[] arguments
+ String raw
+ LineRange lines
}
class Token {
+ Int argumentPosition
+ TokenType type
+ String value
+ Token boundTo
+ OptionDescription optionDescription
+ TokenType[] semanticCandidates
+ PositionalModel[] posModelCandidates()
+ Bool isOptionFlag()
+ Bool isOptionPart()
+ Bool isBoundToOneOf(Binding[] bindings)
+ Bool isBoundTo(Binding binding)
+ Bool matchOptionDescription(OptionDescription[] options)
+ Bool reduceCandidatesWithScheme(OptionScheme scheme)
}
class CallExpressionMetadata {
CallExpression callExpression
OptionExpression[] optionExpressions
Operand[] operands
Token[] tokens
}
enum TokenType {
+ PositionalModel posModel
+ Bool isSemantic()
}
enum ContextFreeTokenType {
+ SemanticTokenType[] semanticCandidates
-----
.. ContextFree and Semantic ..
X2LKT_REVERSE_SWITCH
GNU_EXPLICIT_ASSIGNMENT
X2LKT_EXPLICIT_ASSIGNMENT
POSIX_END_OF_OPTIONS
.. Strictly ContextFree ..
ONE_DASH_WORD
ONE_DASH_LETTER
TWO_DASH_WORD
WORD
}
note "isOption* is resolved to type.posModel.isOption* \nwhen type.posModel is not UNSET or to true when \n '∀c ∈ {posModelCandidates}, c.isOption* = true', false otherwise.\nSeemingly, isBoundToOneOf is resolved to \n'token.type.posModelbinding.binding ∈ {bindings}'\nwhen posModel is not UNSET, otherwise to\n'{bindings} ∩ {token.posModelCandidates} = {bindings}'." as N2
Token .. N2
N2 .. TokenPositionalModel
enum SemanticTokenType {
+ OptionExpressionVariant variant
-----
.. ContextFree and Semantic ..
X2LKT_REVERSE_SWITCH
GNU_EXPLICIT_ASSIGNMENT
X2LKT_EXPLICIT_ASSIGNMENT
POSIX_END_OF_OPTIONS
.. Strictly Semantic ..
POSIX_SHORT_SWITCH
POSIX_GROUPED_SHORT_FLAGS
POSIX_SHORT_ASSIGNMENT_LEFT_SIDE
POSIX_SHORT_ASSIGNMENT_VALUE
POSIX_SHORT_STICKY_VALUE
GNU_IMPLICIT_ASSIGNMENT_LEFT_SIDE
GNU_IMPLICIT_ASSIGNMENT_VALUE
X2LKT_SWITCH
X2LKT_IMPLICIT_ASSIGNEMNT_LEFT_SIDE
X2LKT_IMPLICIT_ASSIGNMENT_VALUE
OPERAND
HEADLESS_OPTION
}
class Parser {
CallExpressionMetadata parse(CallExpression callExpression)
}
TokenType <|-- ContextFreeTokenType
TokenType <|-- SemanticTokenType
OptionDescription o-- ProgramInterfaceModel
OptionExpressionVariant o-- OptionScheme
OptionExpressionVariant o-- OptionDescription
OptionExpressionVariant o--o TokenType
OptionScheme o-- ProgramInterfaceModel
TokenType "1" *-- "*" Token
TokenPositionalModel *-- TokenType
OptionDescription "?" o-- "*" Token
CallExpression o-- Parser
Token o-- CallExpressionMetadata
CallExpressionMetadata o-- Parser
ProgramInterfaceModel o-- Parser
Program o-- ProgramInterfaceModel
@enduml](../../_images/plantuml-5bb3c245a327b7c2f936ef193c902946d116f0e1.png)