Skulpt is a JavaScript implementation of Python. At Pythy, use it to run students' code in the browser. I've been going through the code to try and understand the translation process. Before I could jump into the JavaScript part though (it's written in JavaScript mostly), I realized that I had to figure out how the parse tables are generated.
From the names of the variables I guessed that deterministic finite automatons (dfas) were being used, but I couldn't make sense of all the numbers. So, I decided to look at the python code that parses the python grammar itself. [The grammar, I think, is a Parsing Expression Grammar.]
There are two phases in the generation of the parse table -
python -m pdb main.py test_parse_table.js
The grammar file is read line-by-line. Each line is matched against regular expressions till the end of line is reached. Each match is a valid token. [Main part of the code]
Depending on the first character of the token, a type, from token.py, is assigned to it. The resulting token is a 5 tuple
Finally, each of these three paths are joined to the start and finish states using none transitions to get the NFA for 'single_input'.
The transitions are called 'arcs', the states are instances of 'NFAState' and each transition is represented by a tuple (label, tail state). This tuple is then added to the head state. For example, the transition from state 3 to state 4 is represented as ('simple_stmt', state 4) and added to the arcs of state 3.
2. Conversion of NFA to DFA
I found a good explanation of the conversion process here. The basis for conversion is that all NFA states reachable from a DFA state through a transition are grouped into one DFA state. This group is called a closure. One NFA state can belong to multiple DFA states.
As there is no initial DFA state, the first DFA state is obtained by grouping all NFA states reachable from the start NFA state through a none transition.
In this case the initial DFA state will be
DFA_start = {Start, State 1, State 3, State 5}
NFA states reachable from DFA_start through transition "NEWLINE" =
DFA_state_NEWLINE1 = {State 2, Finish}
NFA states reachable from DFA_start through transition "simple_stmt" =
DFA_state_simple_stmt = {State 4, Finish}
NFA states reachable from DFA_start through transition "compound_stmt" =
DFA_state_compound_stmt = {State 6, State 7}
NFA states reachable from DFA_state_compound_stmt through transition "NEWLINE" =
DFA_state_NEWLINE2 = {State 8, Finish}
There are no more transitions in the NFA. As DFA_state_NEWLINE1,DFA_state_simple_stmt, and DFA_state_NEWLINE2 contain the NFA finish state, they all are finish states in the DFA.
The DFA looks like this -
The red bordered states are the final states of the DFA.
3. Simplifying the DFA
The DFA is simplified by combining states that have the same outward transitions and are all either final states or non-final states.
In this DFA, DFA_state_NEWLINE1, DFA_state_NEWLINE2 and DFA_state_simple_stmt all are final states and have no outward transitions. Hence, they can all be combined to one state - DFA_state_combined.
The simplified DFA looks like this -
Renaming the states to reflect the code in the parse table we get -
The arcs for each state can now be written as
state 3 - [NEWLINE, 1], [simple_stmt, 1], [compound_stmt, 2]
state 1 - []
state 2 - [NEWLINE, 1]
As the DFA is a set of states, the DFA can be written as
[['NEWLINE', 1], ['simple_stmt', 1], ['compound_stmt', 2], [], ['NEWLINE', 1]]
simple_stmt and compound_stmt are symbols in the grammar and have a number associated with them. simple_stmt is 319 and compound_stmt is 269.
NEWLINE on the other hand is a token with the number 4 associated with it.
So, we can rewrite the DFA with numbers instead of strings.
[[4, 1], [319, 1], [269, 2], [], [4, 1]]
Instead of an empty array for the final state, we will write a transition on 0 to itself.
[[4, 1], [319, 1], [269, 2], [0, 1], [4, 1]]
In the code, the DFA is represented as
256: [[[[1, 1], [2, 1], [3, 2]], [[0, 1]], [[2, 1]]]
From the names of the variables I guessed that deterministic finite automatons (dfas) were being used, but I couldn't make sense of all the numbers. So, I decided to look at the python code that parses the python grammar itself. [The grammar, I think, is a Parsing Expression Grammar.]
There are two phases in the generation of the parse table -
- Tokenizing - Breaking the grammar into tokens.
- Generating the set of dfas that accept the grammar.
These two phases are interleaved, that is, a new token is retrieved whenever the current token has been consumed towards the construction of a dfa.
The entry point to the generation process is the main.py file, that writes the parse tables to a file, the path to which must be provided on the command line. I understood the process by stepping through the code using the python debugger module pdb, and setting breakpoints using pdb.set_trace() in the code.
python -m pdb main.py test_parse_table.js
Tokenizing
The grammar file is read line-by-line. Each line is matched against regular expressions till the end of line is reached. Each match is a valid token. [Main part of the code]
Depending on the first character of the token, a type, from token.py, is assigned to it. The resulting token is a 5 tuple
- The assigned token type [NAME, NUMBER, etc.]
- The actual token string from the grammar ['single_input', 'simple_stmt', 'compound_stmt', etc.]
- A tuple indicating start position of the token string in the text
- row at which the token string begins
- column of at which the token string begins
- A tuple indicating end position of the token string in the text
- row at which the token string ends
- column at which the token string ends
- The line string. ["single_input: NEWLINE | simple_stmt | compound_stmt NEWLINE"]
Example of Tokenizing
Line : "single_input: NEWLINE | simple_stmt | compound_stmt NEWLINE"
Tokens :
- (tokenize.NAME, "single_input", (30, 0), (30, 12), "single_input: NEWLINE | simple_stmt | compound_stmt NEWLINE\n")
- (tokenize.OP, ":", (30, 12), (30, 13), "single_input: NEWLINE | simple_stmt | compound_stmt NEWLINE\n")
- (tokenize.NAME, "NEWLINE", (30, 14), (30, 21), "single_input: NEWLINE | simple_stmt | compound_stmt NEWLINE\n")
- (tokenize.OP, "|", (30, 22), (30, 23), "single_input: NEWLINE | simple_stmt | compound_stmt NEWLINE\n")
- (tokenize.NAME, "simple_stmt", (30, 24), (30, 35), "single_input: NEWLINE | simple_stmt | compound_stmt NEWLINE\n")
- (tokenize.OP, '|', (30, 36), (30, 37), "single_input: NEWLINE | simple_stmt | compound_stmt NEWLINE\n")
- (tokenize.NAME, "compound_stmt", (30, 38), (30, 51), "single_input: NEWLINE | simple_stmt | compound_stmt NEWLINE\n")
- (tokenize.NAME, "NEWLINE", (30, 52), (30, 59), "single_input: NEWLINE | simple_stmt | compound_stmt NEWLINE\n")
- (tokenize.NEWLINE, '\n', (30, 59), (30, 60), "single_input: NEWLINE | simple_stmt | compound_stmt NEWLINE\n")
Here the token "NEWLINE" is a NAME token, not a NEWLINE token. An actual newline in the text, "\n", is a NEWLINE token.
DFA Generation
Most of the work is done in the constructor of the ParserGenerator [creating a token generator and the dfas].
For each production rule, an NFA is created. Next, this NFA is converted to a DFA. Finally the DFA is simplified by eliminating common states.
Example of DFA generation
Production rule
single_input: NEWLINE | simple_stmt | compound_stmt NEWLINE
1. Creation of NFA for 'single_input'
Create the start and final states (say start and final).
Each alternative in the right hand side of the production rule is a path that connects the start state to the final state, and can be represented as a set of state transitions.
State transition for NEWLINE -
State transition for compound_stmt NEWLINE has to be broken down into two parts as there can only be one token per transition.
These two transitions can now be combined using epsilon/null/none transitions.
The transitions are called 'arcs', the states are instances of 'NFAState' and each transition is represented by a tuple (label, tail state). This tuple is then added to the head state. For example, the transition from state 3 to state 4 is represented as ('simple_stmt', state 4) and added to the arcs of state 3.
2. Conversion of NFA to DFA
I found a good explanation of the conversion process here. The basis for conversion is that all NFA states reachable from a DFA state through a transition are grouped into one DFA state. This group is called a closure. One NFA state can belong to multiple DFA states.
As there is no initial DFA state, the first DFA state is obtained by grouping all NFA states reachable from the start NFA state through a none transition.
In this case the initial DFA state will be
DFA_start = {Start, State 1, State 3, State 5}
NFA states reachable from DFA_start through transition "NEWLINE" =
DFA_state_NEWLINE1 = {State 2, Finish}
NFA states reachable from DFA_start through transition "simple_stmt" =
DFA_state_simple_stmt = {State 4, Finish}
NFA states reachable from DFA_start through transition "compound_stmt" =
DFA_state_compound_stmt = {State 6, State 7}
NFA states reachable from DFA_state_compound_stmt through transition "NEWLINE" =
DFA_state_NEWLINE2 = {State 8, Finish}
There are no more transitions in the NFA. As DFA_state_NEWLINE1,DFA_state_simple_stmt, and DFA_state_NEWLINE2 contain the NFA finish state, they all are finish states in the DFA.
The DFA looks like this -
The red bordered states are the final states of the DFA.
3. Simplifying the DFA
The DFA is simplified by combining states that have the same outward transitions and are all either final states or non-final states.
In this DFA, DFA_state_NEWLINE1, DFA_state_NEWLINE2 and DFA_state_simple_stmt all are final states and have no outward transitions. Hence, they can all be combined to one state - DFA_state_combined.
The simplified DFA looks like this -
Renaming the states to reflect the code in the parse table we get -
The arcs for each state can now be written as
state 3 - [NEWLINE, 1], [simple_stmt, 1], [compound_stmt, 2]
state 1 - []
state 2 - [NEWLINE, 1]
As the DFA is a set of states, the DFA can be written as
[['NEWLINE', 1], ['simple_stmt', 1], ['compound_stmt', 2], [], ['NEWLINE', 1]]
simple_stmt and compound_stmt are symbols in the grammar and have a number associated with them. simple_stmt is 319 and compound_stmt is 269.
NEWLINE on the other hand is a token with the number 4 associated with it.
So, we can rewrite the DFA with numbers instead of strings.
[[4, 1], [319, 1], [269, 2], [], [4, 1]]
Instead of an empty array for the final state, we will write a transition on 0 to itself.
[[4, 1], [319, 1], [269, 2], [0, 1], [4, 1]]
In the code, the DFA is represented as
256: [[[[1, 1], [2, 1], [3, 2]], [[0, 1]], [[2, 1]]]
Here 256 is the number corresponding to single_input.
The second number of each tuple, the tail state, is the same but the first number differs. This is because the first number is an index into the labels array. If we look at positions 1, 2 and 3 in the labels array, we find that they correspond to 319, 4, and 269, which are the numbers that represent 'simple_stmt', 'NEWLINE' and 'compound_stmt' respectively. I'm not sure why the extra level of indirection exists.
This is how each production rule in the grammar is converted to a DFA. The parse table is a set of DFAs. Each DFA is a set of states and each state is a set of arcs in the form of (label, tail state).The keys in the dictionary that follows this set of states in the DFA definition is the set of symbols that are acceptable starting points for the DFA.
No comments:
Post a Comment