Tuesday, May 10, 2016

Syntax: Colorless green ideas sleep furiously

In 1957, the father of modern linguistics Naom Chomsky came up with a sentence "colorless green ideas sleep furiously" as an example of sentence that is syntactically correct but semantically non-sense. He founded a branch of grammar referred as "Generative Grammar" based on the idea of production rules. Here I talk about the basics, X-Bar theory and generalized phrase structure grammar. 

The basics of Context Free Grammar 

The context free grammar is specified by a system of production rules. Starting with a set of terminal symbols, non-terminal symbol and a distinct starting symbol each production rule specifies how to generate a valid sentence in a given grammar.
Given a sentence identifying the sequence of rules that generate that statement is called Parsing. The raw text is first "tokenized" using lexer and syntax tree is generated during parsing.
  • The grammar is generally specified in a form similar to Backus Naur form
  • There are two ways to parse top-down or bottom-up.
  • The LL parsers look ahead k symbols and deterministically chose next production rule to apply. (Often a simple recursive decent parser can be written "by hand" making generous use of switch statements)
  • LALR parsers on the other hand use a "shift/reduce" parsing where a state transition table is used to decide whether to "shift" current token on to the stack or to "reduce" the stack by applying a grammar rule. This often requires getting help of parser generators like yaac.

The "Language Instinct" and X-Bar Theory

 As linguists started applying formal grammar theory to the natural languages they realized there is an underlying common structure to all languages.
  • The sentence is made up of phrases
  • The sentence always has verb.
  • The order of Subject(S) Object(O) and Verb(V) is fixed for a given language.  Eg. English is SVO, Indic languages are SOV , while semantic languages tend to be VSO.
  • The order of adjectives (A) , Nouns(N) and Specifiers (Sp) is fixed
  • This ordering is similar to way individual phrases themselves are structured internally.

X-Bar theory

The theory is formally called "X-bar" which posits that each phrase is made up of a binary tree.  Each phrase must have a district element called "head" and the phrase structure is specified by following rules. The structure applies to Noun Phrase, Verb Phrase and Preposition Phrase. (Even the sentence itself!)
  • Phrase ::= Spec X'' 
  • X'' ::= X' Adjunct*  
  • X' ::= X Complement*
  • X ::= Head
By changing the order for symbols in each rule we get rules for your favorite language. Eg In English and Indic languages the adjective is before noun , where as in Spanish (and other Romance languages) it comes after noun.
  • the blue sky
  • el cielo azul

Language Instinct

Even the languages spontaneously generated by kids show the same "inherent structure" leading to the hypothesis that "parsing ability" is a part of cognitive circuit that every human is born with. Cognitive Psychologist Steven Pinker explains this very well in his book "Language Instinct".
Essentially for each language we learn  the parameters that fix the order of symbols in X-Bar theory

Generalized Phrase Structure Grammar 

In BNF form of grammar the production rules are given in the following manner
  • A ::= B C D
Where A is a parent and it can have three child nodes in that order.
However the generalized phrase structure grammar parsing strategy will separates immediate  dominance and linear precedence.
Instead of specifying a single production rule, it is broken down into three separate parts 
  • Immediate Dominance : In simple terms it says A is the parent of B, C and D . It does not specify in which order  B, C, D should occur.
Eg Rule A--> B, C, D specifies that A can have three children

  • Linear Precedence : Specifies order in which children are. Orders can be partially specified 
Eg Rule  B C specifies that under A,  B must occur before  C
  • Head Identifying Rule: Each such production has exactly one head element and must occur in the production.
Eg.  ConditionalBlock ~ IfClause

if {} - is valid
if {} else {} - is valid
else {} - is not because the head is missing.

Example is
QueryExpression --> SelectClause, WhereClause, FromClause.

With the separation of Dominance and Precedence all of following are valid production of the rules.

select * from tbl where col == 10;
from tbl where col ==10 select *;

No comments:

Post a Comment