Literature Review

Much interest has been shown in the past in the field of reasoning the typographical features of a document. Of these, the most looked into fields have been font recognition and character identification. But still, not much work has been done in understanding document structures at a higher level. High Level Document Recognition can have tremendous applications in formalizing specific document structures and their verification as pointed out by Nenad Marovas.
In past , many formalisms focused on discretised models of document recognition. These models do not preserve the topological information such as tangency and they work in a numeric framework. Some other techniques ,analogous to compiler design, have been implemented by people in the past to build a context free grammar encompassing a large class of documents.

Nenad Marovac
May 30,1991
Document Recognition- Concepts and Implementations

This paper evolves a two pass strategy for document recognition. The first pass is parsing of the document based on a recognition table which stores the document structure recognition rules that are used to recognize logical constructs within the document. In the second stage, the interactive editing of the logical structure for the document, tasks such as optimal manipulation of the document, reformatting and fast display generation, is performed. The emphasis has been towards developing a recognition driving rules language and a compiler and generalizing them to an organisation-wide data base of documents. The methodology includes a heuristic based approach to recognizing documents rather than adopting absolute measures of the document .

Amitabha Mukerjee and Hiroko Fujihara
Qualitative Reasoning about Document Structures

This paper describes a qualitative approach to obtaining the context from the spatial layout of a document, enhancing to recognizing a generic classes of documents independent of formatting and scaling. The paper uses the Office Document Architecture standards to relate to the representations. The implementation of interval algebra with a set of grammar rules is used in logical structure identification. RLSA and RXYC are used in segmentation into blocks. The paper provides a programmer independent vocabulary for the representation of documents which is:

Independent of any specific document
Apt for describing all inter-block constraints
Independent of imaging scale and global positions

It is a tool to obtain simple contextual information from generic classes of block structured images.

K.Y. Wong , R. G. Casey and F. M. Wahl
1982, The IBM J. of Res. and Dev.
Document Analysis System

This document describes, among other things, the RLSA algorithm for document image segmentation. It presents the design of a full-fledged Document Analysis System. However, the authors have only shown implementation of the document layout analysis part. The first step in the document analysis procedure is to segment and classify the document into text and image regions. A non-linear, smoothing algorithm is used for this purpose. By using regular features of text lines, text blocks are discriminated from others. After this an adaptive approach to recognition of hundreds of font and character sizes is shown. Finally, some experimental results for some prototypes are shown.

Differences in the Present Approach from the Past Work :

In our project ,we propose to concentrate on generating and reconciling the document structure with the available grammar. We wish to further extend the scope of the problem to implement a strategy to unify the grammar for a generic class of documents . We use a blend of approaches that were used in the past:

The present approach uses both RXYC and RLSA for image segmentation.
Some typesetting heuristics are also used in the image processing phase.
A simplification through induction of Block Ordering - from top-left to bottom-right in the Block structured document image.
The interval algebra concept for the characterization of blocks within the document. Qualitatively resolving the block structure into well defined grammar clauses .
The context free grammar for document description is transitive and hierarchical in nature.

CONCEPT OF INTERVAL ALGEBRA