Software and Stuff

Monday, August 18, 2014

Skulpt - Concrete Syntax Tree Generation

This is the first stage in the compilation process of transforming Python code to JavaScript in Skulpt.

As usual, the explanation will be heavily annotated with links to the source code on Github.

The entrypoint to the CST generation is the call to parse. Each line of the source file is fed to the parser. The parser function is just a wrapper over the tokenization process. It also keeps track of status of parsing and returns the rootnode at the end of a successful parse phase.

Parsing and token generation are interleaved just like in the generation of the parse table. The only difference is that tokens are used for the construction of DFAs in the case of parse table generation, whereas in this case, tokens are used to construct the CST.

One method generates the tokens and the other method consumes them. The generation part is the same as in the generation of the parse table. In fact, the tokenizer code is a port of the Python tokenizer code that was used during generation of parse tables.

The consumer method is just a wrapper around the addToken method which does most of the work of generating the CST.

Consider the simple Python program -

print "hello world"

We shall assume that this program is stored in a file that is fed to the compiler.
The tokens generated for this program are -

(T_NAME, 'print', [1, 0], [1, 5], 'print "hello world"\n')
(T_STRING, 'hello world', [1, 6], [1, 19], 'print "hello world"\n')
(T_NEWLINE, '\n', [1, 19], [1, 20], 'print "hello world"\n')
(T_ENDMARKER, null, [2, 0], [2, 0], null)

For each of these tokens, the addToken method is called. The addToken method operates on a stack that is initially populated with the stack entry for the start token. The start token in this case is file_input(287) as we provided a file as the input to the compiler.

So, initially the stack looks like this -

The file_input DFA is here and the file_input CST node has the following form -

{

type : 287,

value : null,

context : null,

children : []

}

[Note : Here context is the same as the line number and column offset]

The addToken method looks at the top of the stack and goes through the states of its DFA till it finds one that has a label matching the current token.

The states of the file_input DFA are

[[[2, 0], [98, 1], [105, 0]], [[0, 1]]]

We only consider state 0 as the stack entry at the top is at state 0.

The first token is the keyword 'print' which is associated with the number 12. We can see that none of the states of the file_input DFA has the number 12. This means that there must be at least one other non_terminal between file_input and the terminal 'print'. Indexing into the labels array with the labels 2, 98, and 105 we get [4, null], [0, null], and [322, null].

The first non terminal label we encounter is 322 or 'stmt'. As 12 is a valid start symbol for the DFA of 'stmt', we proceed to push a stack entry corresponding to 322 on to the stack. The stack now looks like this -

The stmt CST node is of the form -

{

type : 322,

value : null,

lineno : 1,

col_offset : 0,

children : []

}

This is the general form of the CST nodes.

The same process is repeated for the new stack entry that is on the top. In this case, the state 0 of the 'stmt' DFA is [[1, 1], [3, 1]]. As none of the numbers in this state match 12, we index into the labels array to find [319, null] and [269, null]. As 319 is the first non-terminal encountered, and as 12 is a valid start symbol for its DFA, we push 319 onto the stack and update the state of 'stmt' to 1 (0 in the figure is incorrect). The stack now looks like this -

The next node added is the 'print_stmt'. The 4 labels before it (flow_stmt, assert_stmt, expr_stmt and pass_stmt) are skipped as they do not have 12 in their start symbol dictionary. As 12 is one of the labels in state 0 of the DFA of 'print_stmt', we add the CST node for the 'print' keyword as a child of the 'print_stmt' node. The stack now looks like this -

The next set of operations on the stack are straightforward and the resultant stack after the tokens 'print' and 'hello world' have been processed is given below.

As the DFA for 'atom' is in state 1 which is an accepting state, we can pop 'atom' from the stack. The same goes for all stack entries up to 'small_stmt'. After all the popping, the stack now looks like this -

The last two tokens to be processed are the newline and the endmarker tokens. The state 1 of 'simple_stmt' is not an accepting/final state but the newline token takes the 'simple_stmt' dfa from state 1 to 2 which is an accepting state. State 1 of 'stmt' is an accepting state and the endmarker token takes the 'file_input' dfa from state 0 to state 1 which is an accepting state. This empties the stack and results in the final CST.

Concrete Syntax Tree

Saturday, August 16, 2014

Skulpt - Compilation Process

[Note: To know more about skulpt, see this post]

The compilation of Python to JavaScript in Skulpt is done in 3 stages

The Python source code is parsed into a concrete syntax tree [CST]. [Explanation here]
The concrete syntax tree is then converted into an abstract syntax tree [AST].
The AST is traversed and JavaScript code is generated for each construct/node in the tree.

My aim is to write blog posts that document the internals of each of the three stages. I will post the links to each post as and when I'm done with them.

Wednesday, July 2, 2014

Docker Tech Talk

I recently gave a talk on docker during my internship at Rackspace. Thanks to Dave King for organizing the tech talks and Andrew Mussey for compiling the video.

Sunday, June 22, 2014

Skulpt - Generation of Parse Tables

Skulpt is a JavaScript implementation of Python. At Pythy, use it to run students' code in the browser. I've been going through the code to try and understand the translation process. Before I could jump into the JavaScript part though (it's written in JavaScript mostly), I realized that I had to figure out how the parse tables are generated.
From the names of the variables I guessed that deterministic finite automatons (dfas) were being used, but I couldn't make sense of all the numbers. So, I decided to look at the python code that parses the python grammar itself. [The grammar, I think, is a Parsing Expression Grammar.]

There are two phases in the generation of the parse table -

Tokenizing - Breaking the grammar into tokens.
Generating the set of dfas that accept the grammar.

These two phases are interleaved, that is, a new token is retrieved whenever the current token has been consumed towards the construction of a dfa.

The entry point to the generation process is the main.py file, that writes the parse tables to a file, the path to which must be provided on the command line. I understood the process by stepping through the code using the python debugger module pdb, and setting breakpoints using pdb.set_trace() in the code.

python -m pdb main.py test_parse_table.js

Tokenizing

The grammar file is read line-by-line. Each line is matched against regular expressions till the end of line is reached. Each match is a valid token. [Main part of the code]
Depending on the first character of the token, a type, from token.py, is assigned to it. The resulting token is a 5 tuple

The assigned token type [NAME, NUMBER, etc.]
The actual token string from the grammar ['single_input', 'simple_stmt', 'compound_stmt', etc.]
A tuple indicating start position of the token string in the text
- row at which the token string begins
- column of at which the token string begins
A tuple indicating end position of the token string in the text

row at which the token string ends
column at which the token string ends

The line string. ["single_input: NEWLINE | simple_stmt | compound_stmt NEWLINE"]

Example of Tokenizing

Line : "single_input: NEWLINE | simple_stmt | compound_stmt NEWLINE"

Tokens :

(tokenize.NAME, "single_input", (30, 0), (30, 12), "single_input: NEWLINE | simple_stmt | compound_stmt NEWLINE\n")
(tokenize.OP, ":", (30, 12), (30, 13), "single_input: NEWLINE | simple_stmt | compound_stmt NEWLINE\n")
(tokenize.NAME, "NEWLINE", (30, 14), (30, 21), "single_input: NEWLINE | simple_stmt | compound_stmt NEWLINE\n")
(tokenize.OP, "|", (30, 22), (30, 23), "single_input: NEWLINE | simple_stmt | compound_stmt NEWLINE\n")
(tokenize.NAME, "simple_stmt", (30, 24), (30, 35), "single_input: NEWLINE | simple_stmt | compound_stmt NEWLINE\n")
(tokenize.OP, '|', (30, 36), (30, 37), "single_input: NEWLINE | simple_stmt | compound_stmt NEWLINE\n")
(tokenize.NAME, "compound_stmt", (30, 38), (30, 51), "single_input: NEWLINE | simple_stmt | compound_stmt NEWLINE\n")
(tokenize.NAME, "NEWLINE", (30, 52), (30, 59), "single_input: NEWLINE | simple_stmt | compound_stmt NEWLINE\n")
(tokenize.NEWLINE, '\n', (30, 59), (30, 60), "single_input: NEWLINE | simple_stmt | compound_stmt NEWLINE\n")

Here the token "NEWLINE" is a NAME token, not a NEWLINE token. An actual newline in the text, "\n", is a NEWLINE token.

DFA Generation

Most of the work is done in the constructor of the ParserGenerator [creating a token generator and the dfas].

For each production rule, an NFA is created. Next, this NFA is converted to a DFA. Finally the DFA is simplified by eliminating common states.

Example of DFA generation

Production rule

single_input: NEWLINE | simple_stmt | compound_stmt NEWLINE

1. Creation of NFA for 'single_input'

Create the start and final states (say start and final).

Each alternative in the right hand side of the production rule is a path that connects the start state to the final state, and can be represented as a set of state transitions.

State transition for NEWLINE -

State transition for simple_stmt -

State transition for compound_stmt NEWLINE has to be broken down into two parts as there can only be one token per transition.

These two transitions can now be combined using epsilon/null/none transitions.

Finally, each of these three paths are joined to the start and finish states using none transitions to get the NFA for 'single_input'.

The transitions are called 'arcs', the states are instances of 'NFAState' and each transition is represented by a tuple (label, tail state). This tuple is then added to the head state. For example, the transition from state 3 to state 4 is represented as ('simple_stmt', state 4) and added to the arcs of state 3.

2. Conversion of NFA to DFA

I found a good explanation of the conversion process here. The basis for conversion is that all NFA states reachable from a DFA state through a transition are grouped into one DFA state. This group is called a closure. One NFA state can belong to multiple DFA states.
As there is no initial DFA state, the first DFA state is obtained by grouping all NFA states reachable from the start NFA state through a none transition.
In this case the initial DFA state will be
DFA_start = {Start, State 1, State 3, State 5}

NFA states reachable from DFA_start through transition "NEWLINE" =
DFA_state_NEWLINE1 = {State 2, Finish}

NFA states reachable from DFA_start through transition "simple_stmt" =
DFA_state_simple_stmt = {State 4, Finish}

NFA states reachable from DFA_start through transition "compound_stmt" =
DFA_state_compound_stmt = {State 6, State 7}

NFA states reachable from DFA_state_compound_stmt through transition "NEWLINE" =
DFA_state_NEWLINE2 = {State 8, Finish}

There are no more transitions in the NFA. As DFA_state_NEWLINE1,DFA_state_simple_stmt, and DFA_state_NEWLINE2 contain the NFA finish state, they all are finish states in the DFA.
The DFA looks like this -

The red bordered states are the final states of the DFA.

3. Simplifying the DFA

The DFA is simplified by combining states that have the same outward transitions and are all either final states or non-final states.

In this DFA, DFA_state_NEWLINE1, DFA_state_NEWLINE2 and DFA_state_simple_stmt all are final states and have no outward transitions. Hence, they can all be combined to one state - DFA_state_combined.

The simplified DFA looks like this -

Renaming the states to reflect the code in the parse table we get -

The arcs for each state can now be written as
state 3 - [NEWLINE, 1], [simple_stmt, 1], [compound_stmt, 2]
state 1 - []
state 2 - [NEWLINE, 1]
As the DFA is a set of states, the DFA can be written as

[['NEWLINE', 1], ['simple_stmt', 1], ['compound_stmt', 2], [], ['NEWLINE', 1]]

simple_stmt and compound_stmt are symbols in the grammar and have a number associated with them. simple_stmt is 319 and compound_stmt is 269.
NEWLINE on the other hand is a token with the number 4 associated with it.
So, we can rewrite the DFA with numbers instead of strings.

[[4, 1], [319, 1], [269, 2], [], [4, 1]]

Instead of an empty array for the final state, we will write a transition on 0 to itself.

[[4, 1], [319, 1], [269, 2], [0, 1], [4, 1]]

In the code, the DFA is represented as
256: [[[[1, 1], [2, 1], [3, 2]], [[0, 1]], [[2, 1]]]

Here 256 is the number corresponding to single_input.

The second number of each tuple, the tail state, is the same but the first number differs. This is because the first number is an index into the labels array. If we look at positions 1, 2 and 3 in the labels array, we find that they correspond to 319, 4, and 269, which are the numbers that represent 'simple_stmt', 'NEWLINE' and 'compound_stmt' respectively. I'm not sure why the extra level of indirection exists.

This is how each production rule in the grammar is converted to a DFA. The parse table is a set of DFAs. Each DFA is a set of states and each state is a set of arcs in the form of (label, tail state).The keys in the dictionary that follows this set of states in the DFA definition is the set of symbols that are acceptable starting points for the DFA.