4. Call expressions parsing

To add semantics, the call expression parser will try to guess the nature of arguments by converting them into tokens. After givent some context, such tokens will be assembled to metadata, which is comprised of:

  • option assignment expressions and a description of left-side and value parts
  • standalone option expressions and a description of the underlying option
  • command operands and a their description

4.1. Parsing workflow

The Fig. 4.1 shows an overview of the different steps involved in call expression parsing. Those steps are grouped into higher-level steps (A, B, C). The core of call expression parsing is done in B through tokenization (see Section 4.2 for a better understanding on token typing). But some static bash analysis must be done upstream (A, see Section 3.4 for more details about this step). After parsing, the call expression must be assembled to form a metadata structure (C).

@startuml
!include styles.puml

start
(A)
partition "Static shell parsing (A)" {
  :(A.1) Static call expression identified by shell static
  analyser;
  :(A.2) Call expression expanded;
  :(A.3) Call expression word-splitted;
}
(B)
partition "Tokenization (B)" {
  :(B.1) Arguments of call expression
  converted to a list of
  tokens with context-free types;
  :(B.2) Check for the existence of a
  Program Interface Model (UIM)
  associated with the command identifier;
  repeat
  repeat
  :(B.3) For each token which type is non-semantic,
  transform its type to semantic when inference is possible.
  Use UIM if available;
  repeat while (at least one transformation\n occured in the last two parses?) is (yes)
  if (token list contains\nnon-semantic tokens) then (yes)
    :(B.4) Prompt the user to map
    the leftmost non-semantic token
    to a single semantic token ;
  else (no)
    (C)
    detach

  endif
  repeat while (token list contains\nnon-semantic tokens) is (yes)
}
(C)
partition "Assembly (C)" {
  :(C.1) Reassemble tokens into structured option expressions
  assignments, standalone option expressions and command
  operands. Reassembling is done with splitting explicit
  assignments and grouping implicit assignments;
  :(C.2) Build call expression metadata from
  assembled expressions;
  stop
}
@enduml

Fig. 4.1 Call expression parsing dataflow

4.2. Tokenization

The first step consists in creating a list of tokens that maps the command arguments (Fig. 4.1, item B.1). The token types will be updated thanks to basic inference rules and command meta-information. These token types are first assigned to “context-free” tokens (see Table 4.1 for a listing). “Context-free” means that their nature can be captured without the need for information about their siblings or position, and is therefore trivial.

In a second step (Fig. 4.1, item B.3), token types are assigned to “semantic token type” values (Table 4.3) given some inference rules and information extracted from the utility interface model (UIM, Fig. 4.1, item B.2). The underlying algorithm is described in details in Section 4.4.

When semantic type cannot be inferred, a prompt to the user is processed (Fig. 4.1, item B.4).

4.2.1. Context-free tokens typings

The Table 4.1 shows a list of the context-free token types. In the last column, a list of semantic type candidates is provided. This list shows which semantic types this context-free type can be transformed to. Some of these context-free token types overlap semantic token types, because they have only one semantic candidate (resolved to self). They are considered “non-ambiguous” and don’t need further transformation.

Table 4.1 Context-free token types
Context-free token type Is option flag?
Examples
given in
brackets “[]”
Semantic type candidates
POSIX_SHORT_STICKY_VALUE yes [-o<int-value>] self
GNU_EXPLICIT_ASSIGNMENT yes [--option=<value>] self
X2LKT_EXPLICIT_ASSIGNMENT yes [-option=<value>] self
X2LKT_REVERSE_SWITCH yes [+option] self
POSIX_END_OF_OPTIONS yes [--] self
ONE_DASH_LETTER yes
[-o] <value>
[-o]
  • POSIX_SHORT_ASSIGNMENT_LEFT_SIDE
  • POSIX_SHORT_SWITCH
ONE_DASH_WORD_ALPHANUM yes
[-opq]`
[-option]`
  • POSIX_GROUPED_SHORT_FLAGS
  • X2LKT_SWITCH
  • X2LKT_IMPLICIT_ASSIGNEMNT_LEFT_SIDE
ONE_DASH_WORD yes
[-long-option]
[-long-option] <value>
  • X2LKT_SWITCH
  • X2LKT_IMPLICIT_ASSIGNEMNT_LEFT_SIDE
TWO_DASH_WORD yes [--option]
  • GNU_SWITCH
  • GNU_IMPLICIT_ASSIGNMENT_LEFT_SIDE
OPT_WORD no[1]
-o [<value>]
--option [<value>]
-option [<value>]
option
  • OPERAND
  • POSIX_SHORT_ASSIGNMENT_VALUE
  • GNU_IMPLICIT_ASSIGNMENT_VALUE
  • X2LKT_IMPLICIT_ASSIGNMENT_VALUE
  • HEADLESS_OPTION
WORD no
ls [~/]
-o /some/file
--option /some/files
-option /some/file
  • OPERAND
  • POSIX_SHORT_ASSIGNMENT_VALUE
  • GNU_IMPLICIT_ASSIGNMENT_VALUE
  • X2LKT_IMPLICIT_ASSIGNMENT_VALUE

4.2.2. Semantic tokens typings

Note

See the Section 2.2.1 for details on the existing option expression styles from which a majority of those semantic token types are derived.

The Table 4.3 shows a list of the semantic token types. Those types have a positional model (Table 4.2) from which rules can be inferred. For example of such inferences, in the call expression find . -type file, “file” would be a token which positional model is OPT_IMPLICIT_ASSIGNMENT_VALUE and type X2LKT_IMPLICIT_ASSIGNMENT_VALUE and “-type” a OPT_IMPLICIT_ASSIGNMENT_LEFT_SIDE of type X2LKT_IMPLICIT_ASSIGNEMNT_LEFT_SIDE.

Table 4.2 Token positional model
Positionnal model name Description Binding
is
“option part”
is
“option flag”
is
“semantic”
OPT_IMPLICIT_ASSIGNMENT_LEFT_SIDE The left side of an implicit option assignment in the form left-side <value>. right yes yes yes
OPT_IMPLICIT_ASSIGNMENT_VALUE The right side of an implicit option assignment in the form left-side <value>. left yes no yes
STANDALONE_OPT_ASSIGNMENT A token option with value assignment. none yes yes yes
OPT_SWITCH An option switch, that is without value. none yes yes yes
COMMAND_OPERAND A command operand. none no no yes
UNSET Positional model unset. inferred inferred inferred false

In the Table 4.2, the first 5 models are applicable for semantic token types, while the latest is applicable for context-free types. The attributes of the latest are dynamically inferred regarding the set of semantic candidates associated with a token instance. For example, if a context-free type has semantic candidates which positionnal model all have is “option part” set to true, it will infer the attribute to true.

Table 4.3 Semantic token types
Semantic token type
Example, given in brackets, “[]”
Positional model
X2LKT_REVERSE_SWITCH [+option] OPT_SWITCH
POSIX_SHORT_SWITCH [-o] OPT_SWITCH
POSIX_GROUPED_SHORT_FLAGS [-opq] OPT_SWITCH
POSIX_SHORT_ASSIGNMENT_LEFT_SIDE [-o] <value> OPT_IMPLICIT_ASSIGNMENT_LEFT_SIDE
POSIX_SHORT_ASSIGNMENT_VALUE -o [<value>] OPT_IMPLICIT_ASSIGNMENT_VALUE
POSIX_SHORT_STICKY_VALUE [-o<value>] STANDALONE_OPT_ASSIGNMENT
X2LKT_SWITCH [-option] OPT_SWITCH
X2LKT_IMPLICIT_ASSIGNEMNT_LEFT_SIDE [-option] <value> OPT_IMPLICIT_ASSIGNMENT_LEFT_SIDE
X2LKT_IMPLICIT_ASSIGNMENT_VALUE -option [<value>] OPT_IMPLICIT_ASSIGNMENT_VALUE
X2LKT_EXPLICIT_ASSIGNMENT [-option=<value>] STANDALONE_OPT_ASSIGNMENT
GNU_SWITCH --option OPT_SWITCH
GNU_IMPLICIT_ASSIGNMENT_LEFT_SIDE [--option] <value> OPT_IMPLICIT_ASSIGNMENT_LEFT_SIDE
GNU_IMPLICIT_ASSIGNMENT_VALUE --option [<value>] OPT_IMPLICIT_ASSIGNMENT_VALUE
GNU_EXPLICIT_ASSIGNMENT [--option=<value>] STANDALONE_OPT_ASSIGNMENT
POSIX_END_OF_OPTIONS [--] OPT_SWITCH
OPERAND [<operand>] COMMAND_OPERAND
HEADLESS_OPTION [option] OPT_SWITCH

4.3. Analytic Model

@startuml
!include styles.puml

class Program {
  + String projectURL
  + String commandIdentifier
}

class ProgramInterfaceModel {
  Program program
  OptionScheme optionScheme
  OptionDescription[] optionDescriptions
}

enum TokenPositionalModel {
  + Binding binding = 'UNKNOWN' | 'NONE' | 'LEFT' | 'RIGHT'
  + Bool isSemantic
  - Bool isOptionFlag
  - Bool isOptionPart
  ..models..
  OPT_IMPLICIT_ASSIGNMENT_LEFT_SIDE
  OPT_IMPLICIT_ASSIGNMENT_VALUE
  STANDALONE_OPT_ASSIGNMENT
  OPT_SWITCH
  COMMAND_OPERAND
  UNSET
}

enum OptionExpressionVariant {
 Regex flagRegex
 TokenType flagType
 Optional<TokenType> valueType
 OptionStyle style = 'POSIX' | 'XTOOLKIT' | 'GNU' | 'NONE'
 .. variants ..
 POSIX_SHORT_SWITCH
 POSIX_GROUPED_SHORT_FLAGS
 POSIX_SHORT_ASSIGNMENT
 POSIX_SHORT_STICKY_VALUE
 X2LKT_SWITCH
 X2LKT_REVERSE_SWITCH
 X2LKT_IMPLICIT_ASSIGNMENT
 X2LKT_EXPLICIT_ASSIGNMENT
 GNU_SWITCH
 GNU_IMPLICIT_ASSIGNMENT
 GNU_EXPLICIT_ASSIGNMENT
 POSIX_END_OF_OPTIONS
 HEADLESS_OPTION
}

class OptionScheme {
  OptionExpressionVariant[] variants
}

class OptionDescription {
  + OptionExpressionVariants[] supportedVariants
  + ValueModel valueModel = 'NONE' | 'OPTIONAL' | 'MANDATORY'
  + String description
  + Optional<TokenType> matchDescription(Token token)
}

class CallExpression {
 + String commandIdentifier
 + String[] arguments
 + String raw
 + LineRange lines
}

class Token {
  + Int argumentPosition
  + TokenType type
  + String value
  + Token boundTo
  + OptionDescription optionDescription
  + TokenType[] semanticCandidates
  + PositionalModel[] posModelCandidates()
  + Bool isOptionFlag()
  + Bool isOptionPart()
  + Bool isBoundToOneOf(Binding[] bindings)
  + Bool isBoundTo(Binding binding)
  + Bool matchOptionDescription(OptionDescription[] options)
  + Bool reduceCandidatesWithScheme(OptionScheme scheme)
}

class CallExpressionMetadata {
  CallExpression callExpression
  OptionExpression[] optionExpressions
  Operand[] operands
  Token[] tokens
}

enum TokenType {
  + PositionalModel posModel
  + Bool isSemantic()
}


enum ContextFreeTokenType {
  + SemanticTokenType[] semanticCandidates
  -----
  .. ContextFree and Semantic ..
  X2LKT_REVERSE_SWITCH
  GNU_EXPLICIT_ASSIGNMENT
  X2LKT_EXPLICIT_ASSIGNMENT
  POSIX_END_OF_OPTIONS
  .. Strictly ContextFree ..
  ONE_DASH_WORD
  ONE_DASH_LETTER
  TWO_DASH_WORD
  WORD
}

note "isOption* is resolved to type.posModel.isOption* \nwhen type.posModel is not UNSET or to true when \n '∀c ∈ {posModelCandidates}, c.isOption* = true', false otherwise.\nSeemingly, isBoundToOneOf is resolved to \n'token.type.posModelbinding.binding ∈ {bindings}'\nwhen posModel is not UNSET, otherwise to\n'{bindings} ∩ {token.posModelCandidates} = {bindings}'." as N2
Token  .. N2
N2 .. TokenPositionalModel


enum SemanticTokenType {
  + OptionExpressionVariant variant
  -----
  .. ContextFree and Semantic ..
  X2LKT_REVERSE_SWITCH
  GNU_EXPLICIT_ASSIGNMENT
  X2LKT_EXPLICIT_ASSIGNMENT
  POSIX_END_OF_OPTIONS
  .. Strictly Semantic ..
  POSIX_SHORT_SWITCH
  POSIX_GROUPED_SHORT_FLAGS
  POSIX_SHORT_ASSIGNMENT_LEFT_SIDE
  POSIX_SHORT_ASSIGNMENT_VALUE
  POSIX_SHORT_STICKY_VALUE
  GNU_IMPLICIT_ASSIGNMENT_LEFT_SIDE
  GNU_IMPLICIT_ASSIGNMENT_VALUE
  X2LKT_SWITCH
  X2LKT_IMPLICIT_ASSIGNEMNT_LEFT_SIDE
  X2LKT_IMPLICIT_ASSIGNMENT_VALUE
  OPERAND
  HEADLESS_OPTION
}

class Parser {
  CallExpressionMetadata parse(CallExpression callExpression)
}


TokenType <|-- ContextFreeTokenType
TokenType <|-- SemanticTokenType
OptionDescription o-- ProgramInterfaceModel
OptionExpressionVariant o-- OptionScheme
OptionExpressionVariant o-- OptionDescription
OptionExpressionVariant o--o TokenType
OptionScheme o-- ProgramInterfaceModel
TokenType   "1" *-- "*" Token
TokenPositionalModel *-- TokenType
OptionDescription "?" o-- "*" Token
CallExpression o-- Parser
Token o-- CallExpressionMetadata
CallExpressionMetadata o-- Parser
ProgramInterfaceModel o-- Parser
Program o-- ProgramInterfaceModel

@enduml

4.4. Option parsing algorithm

This section offers an in-depth look at tokenization (B) step from Fig. 4.1. The parser will hold in memory a list of tokens (Fig. 4.2). Each of these starts with a context-free type. The parser’s job is considered done when all tokens hold a semantic type. To get there, it will proceed with the following steps :

  1. Initiate the token list with the result of mapping arguments to context-free token generation.

  2. Fetch the utility interface model (UIM) if it exists.

  3. Provide the list and the UIM as arguments of the parse function (Fig. 4.3). Such function will do the following:

    1. Check for the existence of an POSIX_END_OF_OPTIONS typed token (Fig. 4.4) and convert to operands all remaining tokens to the right.

    2. Repeat the following operation until the last two operations didn’t turn out to at least one context-free to semantic conversion:

      For each non-semantic token, inferRight (Fig. 4.5) and inferLeft (Fig. 4.6). Those functions will try to infer the semantic type by checking its siblings’. For example, if the left sibling token type is X2LKT_IMPLICIT_ASSIGNEMNT_LEFT_SIDE, the only possible type for this token would be X2LKT_IMPLICIT_ASSIGNMENT_VALUE. If the token type is “option part”, use the option descriptions from the UIM to try an exact match (Fig. 4.8). For example, the token is --reverse, and the utility interface model contains an option description that exactly match --reverse. If no exact match is found, check for a pattern match with the option scheme (Fig. 4.9). For example, if the token -pq is encountered, and the program option scheme is “Linux-Standard-Explicit” (see Table 2.2), the only possible mapping for ONE_DASH_WORD will be POSIX_GROUPED_SHORT_FLAGS. Finally, increment conversions if the token type “is semantic”.

  4. Until all tokens are of “semantic” type, prompt the user for a token type annotation and loop back at 3.2.

../../_images/parse.png

Fig. 4.3 Parse function

../../_images/checkEndOfOptions.png

Fig. 4.4 CheckEndOfOptions function

../../_images/inferRight.png

Fig. 4.5 InferRight function

../../_images/inferLeft.png

Fig. 4.6 InferLeft function

../../_images/convertToSemantic.png

Fig. 4.7 ConvertToSemantic function

../../_images/matchOptionDescription.png

Fig. 4.8 MatchOptionDescription function

../../_images/reduceCandidatesWithScheme.png

Fig. 4.9 ReduceCandidatesWithScheme function

4.5. Edge cases and extension perspectives

Some argument constructs must be anticipated, so here is a list of problematic examples to open to further enhancements:

  • How to model restricted operands such as in dd(1)? Although they look like headless options, dd operands are “typed”.
  • How to model commands which operands can be another command, such as find -exec <command> {} ; ?

[1]Although HEADLESS_OPTION is an option, it is very rare and should only be matched when defined in a utility interface model, or reviewed by the user. So, by default we assume a WORD is not an option.