How does an OCR algorithm work?

Optical Character Recognition is an image processing technology that analyses printed or handwritten documents and converts the embedded text into the machine-readable, machine-editable format for the computer to effectively process them further.

Published on 19 MAY 2022 | 2 mins read
blogs

For a long time, humankind has been devising a solution of transferring the physical into the virtual as physical; handwritten documents are losing their relevance exacerbated by the unflappable, exponentially rising authority and reign of technology invading every area of human life. Finally, the solution devised for this conundrum is OCR, known as Optical Character Recognition.

Introduction


Ever had to go through the tedious and humdrum process of transferring information from a document into a DOC file in the digital format word-by-word. With towering stacks of paper relentlessly piling up, imagine the languorous hours mercilessly spent in this paper-pushing, monotonous task of transmuting the printed or written text into digital format. Additionally, the inescapable human error that will likely occur in this ginormous yet menial task.


For a long time, humankind has been devising a solution of transferring the physical into the virtual as physical; handwritten documents are losing their relevance exacerbated by the unflappable, exponentially rising authority and reign of technology invading every area of human life. Finally, the solution devised for this conundrum is OCR, known as Optical Character Recognition. 


What is Optical Character Recognition Algorithm?


Optical Character Recognition is an image processing technology that analyses printed or handwritten documents and converts the embedded text into a machine-readable, machine-editable format for the computer to effectively process them further.


This makes storage smooth and straightforward and opens pathways to access the erstwhile remote, inaccessible data. Just imagine doing away with colossal rows and columns of brown-stained documents filling the farthest crevices and nooks of a government basement or storage room just by using this simple-to-use method.

 

How does Optical Character Recognition work?


Optical Character Recognition often pre-processes images to make them legible and readable for the machine, thereby improving the probability of character cognition and recognition. There are a few techniques used to pre-process the images. Let us look into more detail:


Pre-processing


De-skew

If the document was misaligned while scanning leading to the scanned text sloping or leaning upwards and downwards, the machine tilts and shifts the document a few degrees clockwise or anti-clockwise to make it precisely vertical or horizontal.


De-speckle

De-speckle removes the unwanted positive and negative noise that may be the defective by-product of the scanning process. Image noise is the undesirable outgrowth of small dots or stray pixels that appear on the page as a result of image capture. The images that have been sanitized by Despeckle are more decipherable and generally assist in accurate and precise recognition and extraction of text. 


Binarization

Binarization transforms a document image from multitone color or grayscale to a bi-level or binary document image comprising two colors, black and white. The binary images support distinguishing the text in the foreground from the background.


Line removal

Line removal facilitates the removal of vertical and horizontal printed lines overlapping with the handwritten text and elements, especially in signatures.


Layout Analysis or Zoning 

A document image comprises text, diagrams, graphics, half-tones, etc. The goal of Optical Character Recognition is to distill text from the document image. This requires demarcation of text zones from non-textual ones like symbols, tables, graphs and identifying their appropriate reading order.


However, in a short while, it was evident that a general document reading system would require a subdivision of text performing different roles in the document image such as footnotes, captions, paragraphs, etc. The bifurcation of text zones and non-textual ones and segmentation of text in various parts requires decomposing the document image into homogeneous zones comprising elements belonging to different data types, known as Zoning.


Feature extraction

Feature extraction is a way of dimensionality reduction where a massive chunk of data is condensed to more feasible and manageable subsets for processing. These large data sets require too many computing resources to process, making them unfeasible and time-consuming. Here the input data is cut down to a reduced representation set of features. These features are known as feature vectors. 


One typical example of feature extraction that we all can relate to is spam-detection software. If we have a massive amount of emails and there are specific repetitive keywords in these emails, then a feature extraction could find a correlation between these various keywords. For, e.g., Narendra Modi and election are keywords generally associated with each other. Now the set of emails can be described by using a very minimal amount of words. Hence now it is easier to understand whether the email is about the Indian election or someone promising you a revolutionary diet plan. 


There are two ways of feature extraction in OCR


Pattern Recognition

In this method, the character is recognized in its entirety. If everyone wrote the letter “P” in the same way, the computer merely has to compare the letter “P” in any document image with the version of the standardized “P” stored in the system. But how do you ensure that everybody delivers the same P. In the 1960s, a unique font called OCR-A was developed to be used across the board, especially in bank cheques and so on. Every letter was devised with the same width, and strokes were carefully constructed to distinguish each letter from all the other alphabets. Optical Character Recognition was designed and customized to recognize the same font. By standardizing one analogous font, you made the work of Optical Character Recognition easy and uncomplicated.


Feature Detection

This is a much more recent and sophisticated method of feature extraction. It solves the problem left unsolved by the Pattern recognition method. As we mentioned, if everyone wrote the letter “A” the same way, the pattern recognition method is ideal. But this is not the case. Hence, a features detection method is devised that detects and identifies the characters based on evaluating individual lines, convex and concave strokes, endpoints, branches, junctions, and holes. Hence A can be coded as two angled lines connected at a point on the top with a horizontal line between the two angled lines halfway down. This would take care of all the conceivable capital A’s.


Post-processing 

Post-processing is happening as I am producing this article where the spellings of words are checked and evaluated, and those with errors are explicitly highlighted. As a content writer concocting this article, I will probably use Grammarly, a treasure trove for a Grammar Nazi like me where the grammar errors are highlighted, and their subsequent rectification is provided. 


Also, in Grammarly, I am being insinuated for my high-handedness in the English language, where simpler alternative words are provided for all the complex words produced facilitated by the need of showing off my command over the English language. This all falls under the wide gamut of post-processing, where the data is cleaned and sanitized of errors or discrepancies.


Optical Character Recognition accuracy can be substantiated if the output is restrained by a lexicon of words permitted in the document. For, e.g., this could be all the words in English. However, it will be inefficient where proper nouns are used, which are obviously outside the lexicon of the English language.


There are often words in the English language that are always seen together as eternal friends, like Jai-Veeru and the near neighbor analysis uses frequency or probability of co-occurrence to correct two warring words that are incompatible with each other like Washington D.O.C, which is rightfully spelled as Washington D.C.


Conclusion

As the world is becoming increasingly digital, Optical Character Recognition has a pivotal role to play. OCR has made forays in every sphere of human life. However, the development of OCR falls short of what’s possible. OCR is slowly moving from mere seeing and matching. Now OCR is used not merely to replicate analog records in digital format but to extract the meaning of it through deep learning. The future is filled with opportunities and challenges for OCR. And we can’t wait to get started!

Karza Technologies is acquired by Perfios Software Solutions