Lecture 3: Grammars, Derivations, Parse Trees, Scanning, Introduction to Oz
Syntactiv Analysis of Programs
How are programs processed?
- The initial input is linearโit is a sequence of symbols from the alphabet of characters.
- A lexical analyzer (scanner, lexer, tokenizer) reads the sequence of characters and outputs a sequence of tokens.
- A parser reads a sequence of tokens and outputs a structured (typically non-linear) internal representation of the programโa syntax tree (parse tree).
- The syntax tree is further processed, e.g., by an interpreter or by a compiler.
We have seen some of these steps implemented in the mdc interpreter.
Program: if X == 1 then ...
Input: โiโ โfโ โ โ โXโ โ โ โ=โ โ=โ โ โ โ1โ โ โ โtโ โhโ โeโ โnโ ...
Lexemization: โifโ โXโ โ==โ โ1โ โthenโ ...
Tokenization: key(โifโ) var(โXโ) op(โ==โ) int(1) key(โthenโ) ...
Parsing: program(ifthenelse(eq(var(โXโ)
int(1))
...
...)
...)
Interpretation: execution according to language semantics
Compilation: code generation according to language semantics
Derivations
Following the recipe for using a grammar explained earlier, we can derive sentences in the language specified by a grammar in a sequence of steps.
- In each step we transform one sentential form (a sequence of terminals and/or non-terminals) into another sentential form by replacing one non-terminal with the right-hand side of a matching rule.
- The first sentential form is the start variable vs alone.
- The last sentential form is a valid sentence, composed only of terminals.
Rightmost and leftmost derivations
A derivation is a sequence of sentential forms beginning with a single nonterminal and ending with a (valid) sequence of terminals.
- A derivation such that in each step it is the leftmost non-terminal that is replaced is called a โleftmost derivationโ.
- A derivation such that in each step it is the rightmost non-terminal that is replaced is called a โrightmost derivationโ.
- There can be derivations that are neither leftmost nor rightmost.
Syntax Trees
A parse tree (a syntax tree) is a structured representation of a program.
- Parse trees are generated in the process of parsing programs.
- A parser is a function (a program) that takes as input a sequence of tokens and returns a nested data structure corresponding to a parse tree.
The data structure returned by the parser is an internal (intermediate) representation of the program. A parse tree can be used to:
- interpret the program (in interpreted languages);
- generate target code (in compiled languages);
- optimize the intermediate code (in both interpreted and compiled languages);
- analyze the intermediate code, e.g., perform static analysis or compute code metrics (in both interpreted and compiled languages).
Ambiguity
A grammar is ambiguous if a sentence can be parsed in more than one way the program has more than one parse tree.