Friday 11 October 2013

LR parsing Implementation


Topics

LR parsing introduction
Simple LR parsing
Canonical LR parsing
LALR parsing
PARSER GENERATOR ---Yacc
--------------------------------------------------------------------------------------------
Bottom-up (shift-reduce) Parsers
Consider the grammar:
S -> aABe
A -> Abc | b
B -> d
The sentence abcde can be reduced to S, thus:
abbcde
aAbcde
aAde
aABe
S
The technique used here is to scan the string looking for a substring which matches the right side of a production, and which is replaced by the left side of that production.
The preceding reduction steps trace the following rightmost derivation, in reverse:
S -> aABe -> aAde -> aAbcde -> abbcde
Note in the derivation, the choice of productions:
abbcde
aAbcde
aAde ...etc
After the first b was replaced by A, there were 3 substrings which matched the right sides of productions (Abc, b and d). The substring Abc is a handle, defined as the leftmost simple phrase of a sentential form and is always chosen in this technique.
Terminology and Concepts
A shift-reduce parser operates by scanning the input symbols and either shifting them onto a stack or, if the top of stack holds the right end of a handle of the grammar, reducing the handle to a simpler sentential form which becomes the new top of stack item.
A conflict occurs when the parser cannot decide whether to either:
shift or reduce the top of stack (a shift/reduce conflict), or
reduce the top of stack using one of two possible productions (a reduce/reduce conflict)
The grammars which can be parsed in this way are called LR, and are more general than LL grammars.

--------------------------------------------------------------------------------------------------------------

LR parsing introduction
The "L" is for left-to-right scanning of the input and the "R" is for constructing a rightmost derivation in reverse
LR-Parser


Einordnung
Advantages of LR parsing:
LR parsers can be constructed to recognize virtually all programming-language constructs for which context-free grammars can be written.
The LR parsing method is the most general non-backtracking shift-reduce parsing method known, yet it can be implemented as efficiently as other shift-reduce methods.
The class of grammars that can be parsed using LR methods is a proper subset of the class of grammars that can be parsed with predictive parsers.
An LR parser can detect a syntactic error as soon as it is possible to do so on a left-to-right scan of the input.
The disadvantage is that it takes too much work to constuct an LR parser by hand for a typical programming-language grammar. But there are lots of LR parser generators available to make this task easy.
The LR parsing algorithm
The schematic form of an LR parser is shown below. 
The LR parsing algorithm
The schematic form of an LR parser is shown below.




The program uses a stack to store a string of the form s0X1s1X2...Xmsm where sm is on top. Each Xi is a grammar symbol and each si is a symbol representing a state. Each state symbol summarizes the information contained in the stack below it. The combination of the state symbol on top of the stack and the current input symbol are used to index the parsing table and determine the shift-reduce parsing decision. The parsing table consists of two parts: a parsing action function action and a goto function goto. The program driving the LR parser behaves as follows: It determines sm the state currently on top of the stack and ai the current input symbol. It then consults action[sm, ai], which can have one of four values:
i.    shift s, where s is a state
ii.   reduce by a grammar production A -> b
iii.  accept
iv.  error
The function goto takes a state and grammar symbol as arguments and produces a state. For a parsing table constructed for a grammar G, the goto table is the transition function of a deterministic finite automaton that recognizes the viable prefixes of G. Recall that the viable prefixes of G are those prefixes of right-sentential forms that can appear on the stack of a shift-reduce parser because they do not extend past the rightmost handle.
A configuration of an LR parser is a pair whose first component is the stack contents and whose second component is the unexpended input:
(s0 X1 s1 X2 s2... Xm sm, ai ai+1... an$)
This configuration represents the right-sentential form
X1 X1 ... Xm ai ai+1 ...an
in essentially the same way a shift-reduce parser would; only the presence of the states on the stack is new. Recall the sample parse we did (see Example 1: Sample bottom-up parse) in which we assembled the right-sentential form by concatenating the remainder of the input buffer to the top of the stack. The next move of the parser is determined by reading ai and sm, and consulting the parsing action table entry action[sm, ai]. Note that we are just looking at the state here and no symbol below it. We'll see how this actually works later.
The configurations resulting after each of the four types of move are as follows:
If action[sm, ai] = shift s, the parser executes a shift move entering the configuration
(s0 X1 s1 X2 s2... Xm sm ai s, ai+1... an$)
Here the parser has shifted both the current input symbol ai and the next symbol.
If action[sm, ai] = reduce A -> b, then the parser executes a reduce move, entering the configuration,
(s0 X1 s1 X2 s2... Xm-r sm-r A s, ai ai+1... an$)
where s = goto[sm-r, A] and r is the length of b, the right side of the production. The parser first popped 2r symbols off the stack (r state symbols and r grammar symbols), exposing state sm-r. The parser then pushed both A, the left side of the production, and s, the entry for goto[sm-r, A], onto the stack. The current input symbol is not changed in a reduce move.
The output of an LR parser is generated after a reduce move by executing the semantic action associated with the reducing production. For example, we might just print out the production reduced.
If action[sm, ai] = accept, parsing is completed.
If action[sm, ai] = error, the parser has discovered an error and calls an error recovery routine.


LR parsing algorithm
Input: Input string w and an LR parsing table with functions action and goto for a grammar G.
Output: If w is in L(G), a bottom-up parse for w. Otherwise, an error indication.
Method: Initially the parser has s0, the initial state, on its stack, and w$ in the input buffer.
repeat forever begin
let s be the state on top of the stack
and a the symbol pointed to by ip;
if action[s, a] = shift s' then begin
push a, then push s' on top of the stack; // <symbol, state> pair
advance ip to the next input symbol;
else if action[s, a] = reduce A -> b then begin
pop 2* |b| symbols off the stack;
let s' be the state now on top of the stack;
push A, then push goto[s', A] on top of the stack;
output the production A -> b; // for example
else if action[s, a] = accept then
return
else error();
end
Let's work an example to get a feel for what is going on,
An Example
Let's work an example to get a feel for what is going on,
An Example
--------------------------------------------------------------------
(1) E -> E * B
(2) E -> E + B
(3) E -> B
(4) B -> 0
(5) B -> 1
The Action and Goto Table
The two LR(0) parsing tables for this grammar look as follows:


The action table is indexed by a state of the parser and a terminal (including a special nonterminal $ that indicates the end of the input stream) and contains three types of actions: a shift that is written as 'sn ' and indicates that the next state is n, a reduce that is written as 'rm ' and indicates that a reduction with grammar rule m should be performed and an accept that is written as 'acc' and inidcates that the parser accepts the string in the input stream
Simple LR parsing
An LR(0) item (or just item) of a grammar G is a production of G with a dot at some position of the right side indicating how much of a production we have seen up to a given point. For example, for the production E -> E + T we would have the following items:
[E -> .E + T]
[E -> E. + T]
[E -> E +. T]
[E -> E + T.]
We call them LR(0) items because they contain no explicit reference to lookahead. More on this later when we look at canonical LR parsing.
The central idea of the SLR method is first to construct from the grammar a deterministic finite automaton to recognize viable prefixes. With this in mind, we can easily see the following:
the symbols to the left of the dot in an item are on the stack and the symbols to the right are still to come (where are they?).
up until the time when we have the dot to the right of the last symbol of the production, we have a viable prefix (why is this true?).
when the dot reaches the right side of the last symbol of the production, we have a handle for the production and can do a reduction (the text calls this a completed item; similarly it calls [E -> .E + T] an initial item).
an item is a summary of the recent history of a parse (how so?)
items correspond to the states of a NFA (why an NFA and not a DFA?).
Now, if items correspond to states, then there must be transitions between items (paralleling transitions between the states of a NFA). Some of these are fairly obvious. For example, consider the transition from [E -> .(E)] to [E -> (.E)] which occurs when a "(" is shifted onto the stack. In a NFA this would correspond to following the arc labelled "(" from the state corresponding to [E -> .(E)] to the state corresponding to [E -> (.E)]. Similarly, we have [T -> .F] and [T -> F.] which occurs when F is produced as the result of a reduction and pushed onto the stack. Other transitions can occur on e-transitions.
The insight that items correspond to states leads us to the explanation for why we need e-transitions.
Consider a transition on symbol X from [A -> a.Xg] to [A -> aX.g]. In a transition diagram this looks like:

If X is a terminal symbol, this transition corresponds to shifting X from the input buffer to the top of the stack.
Things are more complicated if X is a nonterminal because nonterminals cannot appear in the input and be shifted onto the stack as we do with terminals. Rather, nonterminals only appear on the stack as the result of a reduction by some production X -> b.
How does such a reduction take place? We would have to have the handle b on top of the stack, which means we must have already recognized b.
*** The process of recognizing b begins with the initial item [X -> .b]. So, for every item [A -> a.Xg] we add an e-transition for each production choice X -> b of X, which says that X can be produced by recognizing any of the right-hand sides of its production choices. In the transition diagram it looks like this:

To complete our understanding of the creation of a NFA from the items, we need to decide on the choices for start state and final states.
We'll consider final states first. Recall that the purpose of the NFA is not to recognize strings, but to keep track of the current state of the parse, thus it is the parser that must decide when to do an accept and the NFA need not contain that information.
For the start state, consider the initial configuration of the parser: the stack is empty and we want to recognize S, the start symbol of the grammar. But there may be many initial items [S -> .a] from which to choose.
To solve the problem, we augment our grammar with a new production S' -> S, where S' is the new start symbol and [S' -> .S] becomes the start state for the NFA. What will happen is that when doing the reduction for this production, the parser will know to do an accept.
The following example makes the need for e-transitions and an augmented grammar more concrete. Consider the following augmented grammar:
E' -> E
E -> E + T
E -> T
T -> T * F
T -> F
F -> (E)
F -> id
A quick examination of the grammar reveals that any legal string must begin with either ( or id, resulting in one or the other being pushed onto the stack.
So we would have either the state transition [F -> .(E)] to [F -> (.E)] or the transition from [F -> .id] to [F -> id.].
But clearly to make either of these transitions we must already be in the corresponding state ([F -> .(E)] or [F -> .id]).
Recall, though, that we always begin with our start state [E' -> E] and note that there is no transition from the start state to either [F -> .(E)] or [F -> .id].
To get from the start state to one of these two states without consuming anything from the input we must have e-transitions.
The example from the book makes this a little clearer. We want to parse "(id)".
Item and e-transitions
Stack State Comments
Empty [E'-> .E] can't go anywhere from here

-------- e-transition so we follow an e-transition



Empty [F -> .(E)] now we can shift the (



( [F -> (.E)] building the handle (E); This state says: "I have ( on the stack and expect the input to give me tokens that can eventually be reduced to give me the rest of the handle, E)."
Now let's try to bring all this together with a further example. Let's build the NFA for the following (very simple) grammars:
S -> ( S ) S | e
S -> ( S ) | a
Here's how:
Construct the list of LR(0) items.
Build a first version of the NFA without worrying about e-transitions.
Add the e-transitions using the rule: every item with a dot before S has an e-transition to every initial item of S. (Can you generalize this rule?)
We make a distinction between kernel items and non-kernel items (also known as closure items).
Which items are kernel items?
kernel item
includes S' -> .S and all items whose dots are not at the left end (these will be the items that originate a state as targets of non-e-transitions).
Which items are closure items?
closure item
those items which have their dots at the left end (these are the items added to a state during the e-closure step).
constructing the LR parsing table
To construct the parser table we must convert our NFA into a DFA.
*** The states in the LR table will be the e-closures of the states corresponding to the items!! SO...the process of creating the LR state table parallels the process of constructing an equivalent DFA from a machine with e-transitions. Been there, done that - this is essentially the subset construction algorithm so we are in familiar territory here! We need two operations: closure() and goto().
closure()
If I is a set of items for a grammar G, then closure(I) is the set of items constructed from I by the two rules:
Initially every item in I is added to closure(I)
If A -> a.Bb is in closure(I), and B -> g is a production, then add the initial item [B -> .g] to I, if it is not already there. Apply this rule until no more new items can be added to closure(I).
From our grammar above, if I is the set of one item {[E'-> .E]}, then closure(I) contains:
I0: E' -> .E
E -> .E + T
E -> .T
T -> .T * F
T -> .F
F -> .(E)
F -> .id
goto()
goto(I, X), where I is a set of items and X is a grammar symbol, is defined to be the closure of the set of all items [A -> aX.b] such that [A -> a.Xb] is in I.
The idea here is fairly intuitive: if I is the set of items that are valid for some viable prefix g, then goto(I, X) is the set of items that are valid for the viable prefix gX.
Building a DFA from the LR(0) items
Now we have the tools we need to construct the canonical collection of sets of LR(0) items for an augmented grammar G'.
Sets-of-Items-Construction: to construct the canonical collection of sets of LR(0) items for augmented grammar G'.
procedure items(G')
begin
C := {closure({[S' -> .S]})};
repeat
for each set of items in C and each grammar symbol X
such that goto(I, X) is not empty and not in C do
add goto(I, X) to C;
until no more sets of items can be added to C
end;
algorithm for constructing an SLR parsing table
Input: augmented grammar G'
Output: SLR parsing table functions action and goto for G'
Method:
Construct C = {I0, I1 , ..., In} the collection of sets of LR(0) items for G'.
State i is constructed from Ii:
if [A -> a.ab] is in Ii and goto(Ii, a) = Ij, then set action[i, a] to "shift j". Here a must be a terminal.
if [A -> a.] is in Ii, then set action[i, a] to "reduce A -> a" for all a in FOLLOW(A). Here A may not be S'.
if [S' -> S.] is in Ii, then set action[i, $] to "accept"
If any conflicting actions are generated by these rules, the grammar is not SLR(1) and the algorithm fails to produce a parser.
The goto transitions for state i are constructed for all nonterminals A using the rule: If goto(Ii, A) = Ij, then goto[i, A] = j.
All entries not defined by rules 2 and 3 are made "error".
The inital state of the parser is the one constructed from the set of items containing [S' -> .S].
Example: Build the canonical LR(0) collections and DFAs for the following grammars:
Ex 1:
S -> ( S ) S | e
Ex 2:
S -> ( S ) | a
Ex 3:
E' -> E
E -> E + T
E -> T
T -> T * F
T -> F
F -> ( E )
F -> id
Here is what the corresponding DFA looks like:





Parsing Table:
state
c
d
$

S
C
0
S3
S4


1
2
1


acc



2
S6
S7



5
3
S3
S4



8
4
R3
R3




5


R1



6
S6
S7



9
7


R3



8

R2
R2




9


R2





algorithm for construction of the canonical LR parsing table
Input: grammar G'
Output: canonical LR parsing table functions action and goto
Construct C = {I0, I1 , ..., In} the collection of sets of LR(1) items for G'.
State i is constructed from Ii:
if [A -> a.ab, b>] is in Ii and goto(Ii, a) = Ij, then set action[i, a] to "shift j". Here a must be a terminal.
if [A -> a., a] is in Ii, then set action[i, a] to "reduce A -> a" for all a in FOLLOW(A). Here A may not be S'.
if [S' -> S.] is in Ii, then set action[i, $] to "accept"
If any conflicting actions are generated by these rules, the grammar is not LR(1) and the algorithm fails to produce a parser.
The goto transitions for state i are constructed for all nonterminals A using the rule: If goto(Ii, A) = Ij, then goto[i, A] = j.
All entries not defined by rules 2 and 3 are made "error".
The inital state of the parser is the one constructed from the set of items containing [S' -> .S, $].
Example: Let's rework the following grammar:
A -> ( A ) | a
Every SLR(1) grammar is an LR(1) grammar. The problem with canonical LR parsing is that it generates a lot of states. This happens because the closure operation has to take the lookahead sets into account as well as the core items.
The next parser combines the simplicity of SLR with the power of LR(1).



LALR parsing
We begin with two observations. First, some of the states generated for LR(1) parsing have the same set of core (or first) components and differ only in their second component, the lookahead symbol. Our intuition is that we should be able to merge these states and reduce the number of states we have, getting close to the number of states that would be generated for LR(0) parsing.
This observation suggests a hybrid approach: We can construct the canonical LR(1) sets of items and then look for sets of items having the same core. We merge these sets with common cores into one set of items. The merging of states with common cores can never produce a shift/reduce conflict that was not present in one of the original states because shift actions depend only on the core, not the lookahead. But it is possible for the merger to produce a reduce/reduce conflict.
Our second observation is that we are really only interested in the lookahead symbol in places where there is a problem. So our next thought is to take the LR(0) set of items and add lookaheads only where they are needed. This leads to a more efficient, but much more complicated method.

We'll look at the simple method. :-)
Algorithm for easy construction of an LALR table

Input: G'
Output: LALR parsing table functions with action and goto for G'.
Method:
Construct C = {I0, I1 , ..., In} the collection of sets of LR(1) items for G'.
For each core present among the set of LR(1) items, find all sets having that core and replace these sets by the union.
Let C' = {J0, J1 , ..., Jm} be the resulting sets of LR(1) items. The parsing actions for state i are constructed from Ji in the same manner as in the construction of the canonical LR parsing table. If there is a conflict, the grammar is not LALR(1) and the algorithm fails.
The goto table is constructed as follows: If J is the union of one or more sets of LR(1) items, that is, J = I0U I1 U ... U Ik, then the cores of goto(I0, X), goto(I1, X), ..., goto(Ik, X) are the same, since I0, I1 , ..., Ik all have the same core. Let K be the union of all sets of items having the same core as goto(I1, X). Then goto(J, X) = K.
Consider the above example,
I3 & I6 can be replaced by their union
I36:C->c.C,c/d/$
C->.Cc,C/D/$
C->.d,c/d/$
I47:C->d.,c/d/$
I89:C->Cc.,c/d/$
Parsing Table


state c d $
S
C
0 S36 S47

1
2
1



Acc


2 S36 S47


5
36 S36 S47


89
47 R3 R3



5



R1


89 R2 R2 R2





handling errors
The LALR parser may continue to do reductions after the LR parser would have spotted an error, but the LALR parser will never do a shift after the point the LR parser would have discovered the error and will eventually find the error.




The YACC Parser Generator - Introduction
There are several varieties of LR parsers (LR(0), LR(1), SLR and LALR), with differences depending on amount of lookahead and on construction of the parse table. Their characteristics and use are outside the scope of this unit.
It is possible, however, to automatically generate LALR parsers (which are considered very powerful) using the YACC parser generator provided on Unix (and other systems).
YACC operates in an analogous manner to lex, in that a yacc source file (of similar format) is compiled into C code which implements the parser.

YACC Input Format
The yacc source program format is very similar to that of lex:
declarations
%%
translation rules
%%
supporting C functions
The production (translation) rules:
S -> A | B | C
would be written in yacc source as:
S : A { action 1 }
| B { action 2 }
| C { action 3 }
;

YACC Example
A yacc source file for a simple grammar (a very simple desk calculator) looks like:
%token NAME NUMBER
%%
statement: NAME '=' expression
| expression
{printf("= %d\n", $1;}
;
expression: expression '+' NUMBER
{ $$ = $1 + $3; }
| expression '-' NUMBER
{ $$ = $1 - $3; }
| NUMBER
{ $$ = $1; }
;
%%

Notes on the Example
The rules section (before the first %%) defines tokens (see later). Single quoted characters such as '+' are automatically allowable as tokens without speciying them here.
Every symbol has a value, which is of a type appropriate to the type of the symbol (eg: for a number, the value would be the number itself). Yacc obtains this from yylval. In this parser, the type of the value of a NUMBER will be an int, and the value of a NAME will be a pointer to a symbol table entry.
Whenever a yacc parser reduces a production, it can execute an action rule associated with it. The action rule can refer to the values of the right side as $1, $2, etc, and the value of the left side as $$. Note that the last action code specified is redundant, since $$ = $1 is the default action if none is specified.
Theoretical note -- observe that yacc handles left recursion in the grammar: in fact, it works best if left recursion is present! Left factoring is also unnecessary. This is rather more general than recursive descent ( LL(1)) parsers can handle.

A Simple lexer For The Example yacc Program
This simple lexer doesn't (yet) define the NAME token:
%{
#include "y.tab.h"
%}



%%
[0-9]+ {yylval = atoi(yytext); return NUMBER; }
[ \t] ; /* ignore whitespace */
\n return 0; /logical EOF */
. return yytext[0];
%%
The commands to compile the combined yacc/lex program are:
% yacc -d calc.y
% lex calc.l
% cc -o calc y.tab.c lex.yy.c -ly -ll
% calc
some interesting topics:
LEX-the Lexical Analyzer
The Lex utility generates a 'C' code which is nothing but a yylex() function which can be used as an interface to YACC. A good amount of details on Lex can be obtained from the Man Pages itself. A Practical approach to certain fundamentals are given here.
The General Format of a Lex File consists of three sections:
1. Definitions
2. Rules
3. User Subroutines
Definitions consists of any external 'C' definitions used in the lex actions or subroutines . e.g all preprocessor directives like #include, #define macros etc. These are simply copied to the lex.yy.c file. The other type of definitions are Lex definitions which are essentially the lex substitution strings,lex start states and lex table size declarations.The Rules is the basic part which specifies the regular expressions and their corresponding actions. The User Subroutines are the function definitions of the functions that are used in the Lex actions.
Things to remember:
1. If there is no R.E for the input string , it will be copied to the standard output.
2. The Lex resolves the ambiguity in case of matching by choosing the longest match first or by choosing the rule given first.
3. All the matched expressions are contained in yytext whose length is yyleng.
When shall we use the Lex Substitution Strings ?
Lex Substitution Strings are of importance when the regular expressions become very large and unreadable. In that case it is better to break them into smaller substitution strings and then use them in the Rules Sections. Another use is when a particular Regular Expression appears quite often in number of Rules.
When shall one use the Start states ?
Start states helps in resolving the reduce -reduce error in the Parser. It is particularly important when one wants to return different tokens for the same Regular Expression depending upon what is the previous scanned token.
e.g suppose we have two tokens /keywords CustName & ItemName followed by a string. If the BNF is such that the string reduces to two non terminals then there is reduce/reduce conflict. To avoid this it will be better to return two different tokens for the string depending upon whether CustName was scanned previously or ItemName.
How can one determine correct Table Size Declarations?
There is indeed no hard and fast rule for the above. The best method is not to give any sizes initially. The default table sizes are sufficient for small applications. However if the limits are crossed then Lex itself will generate errors messages giving info on the size violations. Then one can experiment by incrementally increasing the corresponding sizes.
When does it become important to give 'C' code to identify a token instead of a R.E?
Sometimes one might notice that after adding one more R.E to the Lex rules , the number of Lex states increases tremendously. This happens when the particular R.E has a subexpression which clashes with other R.E. Huge number of states implies more case statements and hence reduces the execution speed of yylex functions. Thus it is more economical to identify an initial yet deterministic part of the token and then use input() & unput() functions to identify the remaining part of the token.
One nice example is that of a 'C' comment ( /* comment */). After identifying the '/*' one can write a simple 'C' action to identify the remaining token.
How do we redirect the Stdin to a particular FILE * ?
Normally the Lex takes the input from the Standard input through a macro definition

#define lex_input() (((yytchar=yysptr>yysbuf?U(*--yysptr):getc(yyin))==10? (yylineno++,yytchar):yytchar)==EOF?0:yytchar)
To redirect the yyyin from stdin just do the following
FILE * fp=fopen("Filename","r");
yyin=fp;
Another leagal but bad workaround is to redefine this Macro in the definitions section by replacing yyin with fp. However this will always give a Warning during compilation.
What is the significance of the yywrap() function ?
The yywrap() function is executed when the lexical analyzer reaches the end of the input file. It is generally used to print a summary of lexical analysis or to open another file for reading. The yywrap() function should return 0 if it has arranged for additional input, 1 if the end of the input has been reached.

YACC-Yet Another Compiler-Compiler
Yacc is the Utility which generates the function 'yyparse' which is indeed the Parser. Yacc describes a context free , LALR(1) grammar and supports both bottom-up and top-down parsing.The general format for the YACC file is very similar to that of the Lex file.
1. Declarations
2. Grammar Rules
3. Subroutines
In Declarations apart from the legal 'C' declarations there are few Yacc specific declarations which begins with a %sign.

1. %union It defines the Stack type for the Parser.
It is a union of various datas/structures/ objects.

2. %token These are the terminals returned by the yylex
function to the yacc. A token can also have type
associated with it for good type checking and
syntax directed translation. A type of a token
can be specified as %token <stack member>
tokenName.

3. %type The type of a non-terminal symbol in
the Grammar rule can be specified with this.
The format is %type <stack member>
non-terminal.

4. %noassoc Specifies that there is no associativity
of a terminal symbol.

5. %left Specifies the left associativity of
a Terminal Symbol

6. %right Specifies the right assocoativity of
a Terminal Symbol.

7. %start Specifies the L.H.S non-terminal symbol of a
production rule which should be taken as the
starting point of the grammar rules.

8. %prec Changes the precedence level associated with
a particular rule to that of the following
token name or literal.
The grammar rules are specified as follows:
Context-free grammar production-
p->AbC
Yacc Rule-
p : A b C { /* 'C' actions */}
The general style for coding the rules is to have all Terminals in upper-case and all non-terminals in lower-case.
To facilitates a proper syntax directed translation the Yacc has something called pseudo-variables which forms a bridge between the values of terminal/non-terminals and the actions. These pseudo variables are $$,$1,$2,$3...... The $$ is the L.H.S value of the rule whereas $1 is the first R.H.S value of the rule and so is $2 etc. The default type for pseudo variables is integer unless they are specified by %type ,
%token <type> etc.
How are Recursions handled in the grammar rule ?
Recursions are of two types left or right recursions. The left recursion is of form
list : item { /* first item */ }
| list item { /* rest of the items */ }
The right recursion is of form
list : item { /* first item */ }
| item list { /* rest of items */ }
In right Recursion the Parser is a bit bigger then that of left recursion and the items are matched from right to left.
How are symbol table or data structures built in the actions ?
For a proper syntax directed translation it is important to make full use of the pseudo variables. One must have structures/classes defined of all the productions which can form a proper abstraction. The yystack should then be a union of pointers of all such structures/classes. The reason why pointers should be used instead of structures is to save space and also to avoid copying structures when the rule is reduced. Since the stack is always updated i.e on reduction the stack elements are popped and repaced by L.H.S so any data that was refered gets lost.

LR parser

In computer science, an LR parser is a parser that reads input from Left to right (as it would appear if visually displayed) and produces a Rightmost derivation. The term LR(k) parser is also used; where the k refers to the number of unconsumed "look ahead" input symbols that are used in making parsing decisions. Usually k is 1 and the term LR parser is often intended to refer to this case.
The syntax of many programming languages can be defined by a grammar that is LR(1), or close to being so, and for this reason LR parsers are often used by compilers to perform syntax analysis of source code.
An LR parser can be created from a context-free grammar by a program called a parser generator or hand written by a programmer. A context-free grammar is classified as LR(k) if there exists an LR(k) parser for it, as determined by the parser generator.
An LR parser is said to perform bottom-up parsing because it attempts to deduce the top level grammar productions by building up from the leaves.
A deterministic context-free language is a language for which some LR(k) grammar exists. Every LR(k) grammar for k > 1 can be mechanically transformed into an LR(1) grammar for the same language, while an LR(0) grammar for the same language may not exist; the LR(0) languages are a proper subset of the deterministic ones.
An LR parser is based on an algorithm which is driven by a parser table, a data structure which contains the syntax of the computer language being parsed. So the term LR parser actually refers to a class of parser that can process almost any programming language, as long as the parser table for a programming language is supplied. The parser table is created by a program called a parser generator.
LR parsing can be generalized as arbitrary context-free language parsing without a performance penalty, even for LR(k) grammars. This is because most programming languages can be expressed with LR(k) grammars, where k is a small constant (usually 1). Note that parsing of non-LR(k) grammars is an order of magnitude slower (cubic instead of quadratic in relation to the input length).
LR parsing can handle a larger range of languages than LL parsing, and is also better at error reporting, i.e. it detects syntactic errors when the input does not conform to the grammar as soon as possible. This is in contrast to an LL(k) (or even worse, an LL(*) parser) which may defer error detection to a different branch of the grammar due to backtracking, often making errors harder to localize across disjunctions with long common prefixes.
LR parsers are difficult to produce by hand and they are usually constructed by a parser generator or a compiler-compiler. Depending on how the parsing table is generated, these parsers can be called simple LR parsers (SLR), look-ahead LR parsers (LALR), or canonical LR parsers. LALR parsers have more language recognition power than SLR parsers. Canonical LR parsers have more recognition power than LALR parsers.

Contents