Corpus is a collection of large number of texts in a language. The texts in the corpus of a language are usually chosen from a diverse set of fields so that they are representative of the language. We took the corpus from CFLIT, IIT Bombay.
No. of lines - 221528
No. of words - 2849514
No. of characters - 38413350
So, the corpus contains roughly 2.8 million words.
According to the Oxford dictionary, a syllable is 'a unit of pronunciation having one vowel sound, with or without surrounding consonants, forming the whole or a part of a word'. For example, there are three syllables in the Hindi word आवाज़ (aavaaja, sound): आ (aa), वा (vaa) and ज़ (ja). The first syllable आ is a vowel. The second syllable is a combination of a consonant व् and a vowel आ. Finally, the third syllable is a combination of a consonant ज़् and a vowel अ.