10:51:33.114 hazelworker Processing folder Skrivebord 10:51:35.275 hazelworker Tmp: Rule Old files to temp matched. The obtained results show that the proposed hybrid system outperforms the rule-based system.Code: Select all 10:50:20.947 hazelworker example.pdf: Rule OCR pdf matched. After training the hybrid system, we used 500 images for testing and the results show that the word error rate enhanced from 14.95 to become 14.42. The proposed hybrid OCR post-processing system improves the results based on using 1000 training images from a word error rate of 24.02% to become 18.96%. On the other hand, after this training, we apply the rule-based system on 500 images as a testing dataset and the word error rate is improved from 14.95% to become 14.53%. Our experimental results show that the rule-based system improves the word error rate from 24.02% to become 20.26% by using a training data set of 1000 images. Also, the evaluation of the OCR post-processing system results is automated using our novel alignment technique which is called fast automatic hashing text alignment. In order to train the proposed model, we build Arabic OCR context database which contains 9000 images of Arabic text. To the best of our knowledge, this is the first end-to-end OCR post-processing model that is applied to the Arabic language. The proposed model is language independent and non-constrained with the string length. This paper provides a statistical Arabic language model and post-processing techniques based on hybridizing the error model approach with the context approach. Correcting OCR errors is more complicated when we are dealing with the Arabic language because of its complexity such as connected letters, different letters may have the same shape, and the same letter may have different forms. In order to minimize the number of incorrect words in the obtained text, OCR post-processing approaches can be used. The resulted text from the OCR usually does not match the text in the original document. Optical character recognition (OCR) is the process of recognizing characters automatically from scanned documents for editing, indexing, searching, and reducing the storage space. The proposed system has an accuracy of 98.1% for ‘zero‐width non‐breaking space’ and 98.64% for ‘LA’ at the word level. The experimental results indicated that the proposed system has average accuracy of 99.69% at the letter level. The proposed evaluations show that the accuracy of the proposed OCR is increased by 2%, compared to the existing Persian OCR system. A new comprehensive collated data set is introduced, comprising five million images with eight popular Persian fonts and in ten various font sizes. Moreover, the authors present a preprocessing algorithm to remove ‘justification’ using image processing. The proposed OCR system solves false recognition of sub‐word ‘LA’ and ‘LA’. The authors also investigate the effects of variations of parameters, involved in this approach. In this study, the authors propose an OCR system based on long short‐term memory neural networks for the Persian language. In recent studies, OCR systems for non‐Latin texts involving cursive style have also been introduced despite posing some challenges. Currently, most existing OCR systems have been focused on Latin languages. Optical character recognition, known as OCR, has been widely used due to high demand of different technologies.
0 Comments
Leave a Reply. |
Details
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |