ANLP(4): Parsing — CYK Algorithm
An Algorithmic Approach to NLP
Consider the ambiguous sentence “I shot an elephant in my pajamas”. It can be interpreted in two ways, the first one implies that the elephant was in the pajamas, and the second one implies that the person who shot the elephant was earing pajamas. This is called structural ambiguity. It is important for machines to disambiguate such sentences in order to make proper sense of them, which is an essential part of NLP. Fortunately, the Cocke-Kasami-Younger(CKY/CYK) algorithm below is designed to efficiently handle structural ambiguities.
We can represent the above example using two parse trees as shown below:
Here, the parse trees are constructed via certain rules, called grammars. The grammar used here is:
The CYK algorithm essentially returns an efficient representation of the set of parse trees for a sentence, this is called syntactic parsing. It is a parsing algorithm for context free grammar(which is a special kind of grammar).
CYK Algorithm — A Dynamic Programming approach
Similar to the minimum edit distance problem, we can use dynamic programming for the CYK algorithm. Dynamic Programming works exceptionally well here due to the context-free nature of grammar rules, meaning we can always memorize a derivation and make use of it further derivations, which is very common in context free grammars. This saves a lot of time and storage as one can look up subtrees in a table and not iterate over them multiple times.
Conversion to Chomsky Normal Form:
The first step of CYK algorithm is to convert the Context free grammar to Chomsky Normal Form(CNF). Grammars in CNF are simply restricted to the forms A -> B C or A -> w. This means that the RHS rule must expand to either two non-terminals or a single terminal. Any context free grammar can be converted to CNF, so there is no loss in information in this conversion.
We do this by following these steps:
- Copy all conforming rules to the new grammar unchanged.
- Convert terminals within rules to dummy non-terminals.
- Convert unit productions.
- Make all rules binary and add them to new grammar.
The picture given below shows a context free grammar converted to CNF.
CYK Recognition
Since our grammar is in CNF now, we know for sure that each non-terminal node above the last layer in parse tree will have exactly 2 daughters. So we can construct a 2D matrix to store this information. If the input sentence is of length n, we will work with the upper-triangular portion of an (n+1)x(n+1) matrix. Each cell [i, j] in this matrix contains the set of non-terminals that represent all the constituents that span positions i through j of the input. This implies that the cell that represents the entire input is the one in position [0,n] in the matrix.
Because each non-terminal entry in our table has two daughters in the parse, there must be a place in the input, k, where each element represented by an entry I j] can be split into two parts such that I k j. Given a position k, the first ingredient I k] must be to the left of entry I j] somewhere along row I and the second constituent [k, j] must be to the right of it, somewhere along column j.
Consider the sentence, “Book the flight through Houston.” we place each word of this sentence in the [i, i] cells.
The parse table for this would be:
We go from left to right and bottom to top to fill the tables according to the rules. This is basically processing each word one at a time. This is Dynamic Programming as we are solving minor versions of the problem and using their results to further solve more problems.
The pseudocode can be given as:
At the end, we simply need to check if S is in [0, n], if it is, we accept the parse.
Since we have three nested for loops, along with a loop to check if the rule belongs to the grammar, the Time complexity of this algorithm is O(n³.|G|). (where |G| is the number of rules in the given grammar and n is the input size).
Using the CYK algorithm, we get a set of parse trees. However, we still need to use neural models in order to find the best parse tree, which we will not be covering in this blog.