IITK / CSE / Students / Ankit Soni
Indian Institute of Technology, 
     Department of Computer Science and Engineering

Research

Details of my thesis will be added soon
I am in the process of deciding my thesis supervisor and will try to update this in a few days.
POS Projection across Parallel Corpora as a cue to Detecting Complex Predicates in Hindi

Complex Predicates are a cross linguistically general phenomenon, but are more pervasive in
South Asian languages. In Hindi the occurrences of Noun+Verb (N+V) Adjective+Verb (Adj+V) complex predicate is very common where the entire complex acts as a verb. In this study we try to
word align a parallel corpus of Hindi and English using traditional models and then project the tags
directly from English sentences to Hindi sentences.

Brill Tagger is used to tag the English sentences which use the Penn Treebank tag set.In Hindi tagging
also the same tag set is preserved more or less although some of the finer-grained distinctions have
been merged.

One area where we observe a systematic difference of Part of Speech tags of words across
English-Hindi are the Complex Predicates, hence such mismatches give a strong cue for the existence
of N+V or Adj+V complex predicates. Along with a possible approach to detect Complex Predicates,
we also produce POS tagged corpus of Hindi sentences with over 47% accuracy without using any language input.If some language specific heuristics are applied this accuracy increases to about 66%.

The accuracy could be increased further if more data can be added in the study,since right now
the data is very less. This corpus can be used to learn a statistical tagger. The only language input used
is the knowledge of the tag set to be used for the language and a bilingual lexicon. Since the language
input is very minimal so the approach can be applied to other languages also for which we have parallel corpora.

 


[Home] [Research] [Resume] [Other Projects] [Courses] [Contact Info]