|
Advanced search
Previous page
|
Title
RETRIEVAL FROM REAL-LIFE AMHARIC DOCUMENT IMAGES |
Full text
http://etd.aau.edu.et/dspace/handle/123456789/4541 |
Date
2012 |
Author(s)
BINIAM, ASNAKE |
Contributor(s)
Dr. Million Meshesha (PhD) |
Abstract
A Thesis Submitted to the School of Graduate Studies of Addis Ababa University in Partial Fulfillment of the Requirements for the Degree of Master of Science in Information Science - Bulk of real life documents contain vital information and knowledge about history, culture, economy, politics, religion and science that are available in written form in Ethiopic script. This knowledge ought to be shared and the advancement of technology and research in Information Retrieval (IR), Artificial Intelligence (AI) and related fields bring the need to digitize documents and make it available for public use. The two major approaches of retrieving information from document images are recognition-based (optical character recognition /OCR/) and recognition-free (document image retrieval without explicit recognition /DIR/). The first approach is a long term process, error-prone and registers minimized performance for noisy documents, where as document image retrieval without explicit recognition is the preferred one. A few researches have been conducted to develop a recognition-free document image retrieval system that extracts information from document images relying on image features only. These systems are highly affected by noise in real life documents which results from paper aging, folding, scanning and printing errors. In this study, an attempt is made to integrate effective noise reduction and thresholding techniques to enhance the effectiveness of the system in searching within real-life document images. This study also improves the online searching process of the system by accepting multiple query terms then retrieving documents in recall-oriented manner and achieve 77.33% F-measure. A combination of three noise reduction techniques: median, adaptive median and wiener filters, and three thresholding techniques: Otsu's, Niblack's and Sauvola's techniques are experimented in printed real-life documents plagued by low, medium, high and very high noise. Performance analysis shows that the best performing combination of denoising and thresholding techniques are wiener filtering and Otsu thresholding. Finally, the performance of the system is evaluated before and after the integration of the selected preprocessing techniques in which an average overall performance of 82.37% F-measure is registered in documents having low, medium, high and very high levels of noise. The major challenge is segmentation error where the current system either considers multiple separate words as one because of noise or a single word as multiple words when the noise is removed and the space between characters of a single word is large enough to be a word (segmentation threshold value) by the segmentation algorithm. |
Subject(s)
Information scince |
Language
en |
Publisher
aau |
Type of publication
Thesis |
Repository
Addis Ababa - University of Addis Ababa
|
Added to C-A: 2014-01-30;07:18:08 |
© Connecting-Africa 2004-2024 | Last update: Friday, March 8, 2024 |
Webmaster
|