Hindi Corpus Analysis


What is a Corpus?


Corpus is a collection of large number of texts in a language. The texts in the corpus of a language are usually chosen from a diverse set of fields so that they are representative of the language. We took the corpus from CFLIT, IIT Bombay.


Corpus Size


No. of lines - 221528

No. of words - 2849514

No. of characters - 38413350

So, the corpus contains roughly 2.8 million words.


What is a syllable?


According to the Oxford dictionary, a syllable is 'a unit of pronunciation having one vowel sound, with or without surrounding consonants, forming the whole or a part of a word'. For example, there are three syllables in the Hindi word आवाज़ (aavaaja, sound): आ (aa), वा (vaa) and ज़ (ja). The first syllable आ is a vowel. The second syllable is a combination of a consonant व् and a vowel आ. Finally, the third syllable is a combination of a consonant ज़् and a vowel अ.



Most frequent words in Hindi


Most frequent syllable-grams (2,3,4,5,6) in Hindi

Most frequent 2-grams in Hindi

Most frequent 3-grams in Hindi

Most frequent 4-grams in Hindi

Most frequent 5-grams in Hindi

Most frequent 6-grams in Hindi


Text files for download


For any technical glitches/discrepancy/query email me at, prasant@iitk.ac.in