In the early days of programming languages, both the syntax and semantics of a language were described by lengthy English explanations
Noam Chomsky developed the idea of context-free grammars, and John Backus, with contributions by Peter Naur, developed a notational system for describing these grammars that was used for the first time to describe the syntax of Algol60
1950s
Lexical structure of a programming language
The structure of its tokens, or words
Typical token categories
Reserved words/keywords
Literals/constants
Special symbols
Identifiers
Reserved words
An identifier cannot have the same character string as a reserved word
Predefined identifiers
Identifiers that have been given an initial meaning for all programs in the language but that are capable of redefinition
Free-format language
Format has no effect on the program structure
Fixed format language
All tokens must occur in pre-specified locations on the page
Regular expressions
Describe tokens formally
Context-free grammar
Consists of a series of grammar rules, with a left-hand side that is a single phrase structure name, followed by the metasymbol "→", followed by a right-hand side consisting of a sequence of items that can be symbols or other phrase structure names
Nonterminals
Names for phrase structures, broken down into further phrase structures
Terminals
Words or token symbols
Grammar rules
Also called productions, "produce" the strings of the language using derivations
Backus-Naur Form (BNF)
Productions are in this form if they are as given using only the metasymbols "→" and "|"
Context-free grammar
Defines a language, called the language of the grammar
Grammar rule examples
<digit> → 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
<nat> → <digit>
<nat> → <digit> <nat>
Parse tree
Graphical depiction of the replacement process in a derivation, labeled by nonterminals at interior nodes and terminals at leaves
Abstract syntax tree
Abstracts the essentialstructure of the parse tree, may do away with redundant terminals
Ambiguity
Twodifferentderivations can lead to the same parse tree or syntax tree
Left-recursive rule
Causesleft-association
Right-recursive rule
Causesright-association
Lexical conventions
Specifydetails of formatting, such as white-space conventions, separate from the grammar
Reserved words
Fixed strings of characters that are tokens themselves and cannot be used as identifiers
Predefined identifiers
Fixed strings that are given a predefined meaning in a language, but this meaning can be changed by redefining it within a program
Syntax and semantics can become interdependent in languages when semantic information must be used to distinguish ambiguous parsing situations
During its scanning phase, a translator collects sequences of character from the input program and forms them into tokens.
During its parsing phase, the translator then processes these tokens, thereby determining the program's syntactic structure
Reserved words
sometimes called keywords (if and while)
Literals or constants
such as 42 (numeric literal) or "hello" ( a string litral)
Special symbols
; , <=, +
Identifiers
such as x24, monthly_balance, or putchar
double if;
This variable declaration is illegal in C because 'if' is a reserved word.
Examples of predefined identifiers are all standard data types in Ada, such as integer and boolean, and in Python, such as round and print.
The format of a program can affect the way tokens are recognized
The principle of longest substring requires that certain tokens can be separated by "token delimiters" or white space
Most modern languages are free format
Tokens in a programming language are often described in English, but they can also be described formally by regular expressions.