Detecting and Correcting Real-word Errors in Tamil Sentences
Abstract
Spell checkers concern two types of errors namely non-word errors and real-word errors. Non-word errors can be of two categories: First one is that the word itself is invalid; the other is that the word is valid but not present in a valid lexicon. Real-word error means the word is valid but inappropriate in the context of the sentence. An approach to correcting real-word errors in Tamil language is proposed in this paper. A bigram probability model is constructed to determine appropriateness of the valid word in the context of the sentence using a 3GB volume of corpora of Tamil text. In case of lacking appropriateness, the word is marked as a real-word error and minimum edit distance technique is used to find lexically similar words, and the appropriateness of such words is measured by a word-level n-gram language probability model. A hash table with word-length as the key is used to speed up the search for words to check for the lexical similarity. Words of lengths of m-1 to m+1 are considered with m being the length of the word found to be ‘inappropriate’. Test results show that the suggestions generated by the system are with more than 98% accuracy as approved by a Scholar in Tamil.References
Damerau, 1964. A technique for computer detection and correction of spelling errors. Communications of the ACM, March, 7(3), pp. 171-176.
Jurafsky & Martin, 2017. Language Modeling with N-grams. [Online]
Available at: https://web.stanford.edu/~jurafsky/slp3/4.pdf
Kukich, 1992. Techniques for Automatically Correcting Words in Text. ACM Computing Survey, 24(4), pp. 377--439.
Navalar, 1998. Tamil Grammar Questions and Answers. No. 366, Kankesanthurai Road, Jaffna: Vannai Santhayarmadam.
Nuhman, 2013. Basic Tamil Grammar. University of Peradeniya, Readers Association, Kalmunai.
Sakuntharaj & Mahesan, 2016. A novel hybrid approach to detect and correct spelling in Tamil text. Galle, Sri Lanka, International Conference on Information and Automation for Sustainability (ICIAfS), pp. 1-6.
Sakuntharaj & Mahesan, 2017. Use of a Novel Hash-Table for Speeding-up Suggestions for Misspelt Tamil Words. Kandy, Sri Lanka, 12th IEEE International Conference on Industrial and Information System (ICIIS).
Sakuntharaj & Mahesan, 2017. Use of N-gram Technique with a Hash Table to Generate Suggestions for Tamil Misspelt Words. Jaffna, Sri Lanka, Jaffna Science Association, p. 12.
Samanta, Pratip, Chaudhuri & Bidyut, 2013. A simple real-word error detection and correction using local word bigram and trigram. Taiwan, ROCLING.
Sangar, 2006. Tamil Grammar. Puduchcheri, India, Nanmozi Printers.
Wagner & Fischer, 1971. The String to String Correction Problem. J. ACM, Volume 21, pp. 168-173.
Downloads
Published
Issue
Section
License
From Volume 7 (2016) onwards, all articles published in Ruhuna Journal of Science are Open Access articles published under the Creative Commons CC BY-NC 4.0 International License. This License permits use, distribution and reproduction in any medium, provided the original work is properly cited and is not used for commercial purposes.
Copyright on any research article published in RJS is retained by the respective author(s).
Authors who publish with this journal agree to the following terms:
a) Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License CC-BY-NC 4.0 International, that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
b) Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
c) Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).