Developing Assamese Information Retrieval System Considering NLP Techniques: an attempt for a low resourced language

Anup Kumar Barman, Jumi Sarmah, Shikhar Kumar Sarma

Abstract


This paper engulfs the activities involved in developing a Monolingual Information Retrieval (IR) system for an Indo-Aryan language- Assamese. In a multilingual country like India, where 23 official languages exist, the task of digitizing local language contents is growing tremendously. To meet the need of each individual’s relevant information, monolingual Information Retrieval in own language is very essential. The work aims to develop a search engine that retrieves relevant information for the fired query in one's respective language. Various Linguists, Researchers collaborated with the work, provided valuable information and developed various important resources. Many informative resources, language resources, tools & technologies were research, analyze, develop and applied in implementing the overall pipeline. The search engine is frame worked on open search platforms- Solr and Nutch with NLP applications embedded in it. Computational Linguistics or Natural Language Processing (NLP) enhances the performance of the IR system. Each phase of the system is being elaborately described in this paper and explained step-wise. This work is a remarkable contribution to Assamese language technology and an important application of NLP.

Full Text:

PDF

References


T. Brants and Google Inc, “Natural language processing in information retrieval,” in Proceedings of the 14th Meeting of Computational Linguistics in the Netherlands, pp. 1–13,2004

Ricardo Baeza-Yates and Berthier Ribeiro-Neto. Modern Information Retrieval. ACM Press, 1999

H. Joho and M. Sanderson. Document frequency and term specificity. In the Recherche d’ Information Assiste par Ordinateur Conference (RIAO), 2007

Roman W. Swiniarski Lukasz A. Kurgan Krzysztof J. Cios, Witold Pedrycz. Data Mining: A Knowledge Discovery Approach. Springer, 2007

Howard Robert Turtle, “Inference networks for document retrieval” (1991). Doctoral Dissertations Available from Proquest. AAI9120950

D Metzler and W.B. Croft, “Linear feature based models for information retrieval”, Inf. Retr. 16.1-23,2007

https://www.kth.se/social/upload/507d1d3af276540519000002/Moore%27s%20law. pdf

P. Switzer, ‘Vector Images in Document Retrieval’, Harvard University, ISR - 4, 01 1963

G. Salton, Automatic Information Organization and Retrieval. McGraw Hill Text, 1968

J. J. Rocchio, ‘Relevance Feedback in Information Retrieval’, Harvard University, ISR -9,1965

S. Bjørner and S. C. Ardito, ‘Online Before the Internet, Part 1: Early Pioneers Tell Their Stories’, Searcher: The Magazine for Database Professionals, vol.11,no.6, Jun– 2003

K. Spärck Jones, ‘A statistical interpretation of term specificity and its application in retrieval’, Journal of documentation, vol.28, no.1, pp.11-21,1972

G. Salton and C. S. Yang, ‘On the Specification of Term Values in Automatic Indexing’, Department of Computer Science, Cornell University, Ithaca, New York, 14850, USA, TR73-173,1973

G. Salton, A. Wong, and C. S. Yang, ‘A vector space model for automatic indexing’, Communications of the ACM, vol.18,no.11,pp.613–620, 1975

G. Salton and C. Buckley, “Term-weighting approaches in automatic text retrieval”, Information processing &management, vol.24,no.5,pp.513-523,1988

S.E. Robertson, ‘The probability ranking principle in IR’, Journal of documentation, vol.33,no.4,pp.294-304,1977

S. Deerwester, S.T. Dumais, G.W. Furnas, T.K. Landauer and R. Harshman,‘Indexing by latent semantic analysis’, Journal of the American society for information science, vol.41,no.6,pp.391-407,1990

G. Salton 1971, “The Smart Retrieval System: Experiments in Automatic Document Processing”, Prentice-Hall, Englewood Cliffs, NJ

S.E. Robertson and K.S. Jones, 1976, “Relevance Weighting of Search Terms”, Journal of the American Society for Information Sciences, 27(3):pages 129–146

N Sager, 1981, Natural Language Information Processing: A Computer Grammar of English and its Application. Addison-Wesley

G.F. DeJong, 1979, “Prediction and Sustantiation: A New Approach to Natural Language Processing”, Cognitive Sciences, 3:pages 251–273

G.F. DeJong, 1982 “An Overview of the FRUMP System-/”, In W. G. Lehnert and M. H. Ringle, editors, Strategies for Natural Language Processing, pages 149–176. Erlbaum, Hillsdale, N.J.

P.M. Andersen, P.J. Hayes, A.K. Heuttner, L.M Schmandt and I.B. Nirenberg, 1993 “Automatic Extraction”. In Proc. of the Conference of the Asssociation for Artificial Intelligence, pages 1089–1093

Apache Lucene(2011) http://lucene.apache.org/

Apache Solr (2011) http://lucene.apache.org/solr/

Apache Nutch(2005)http://nutch.apache.org/

Regain (2004) http://regain.sourceforge.net/

Oxyus (2010) http://sourceforge.net/projects/oxyus/

Swish-e (2007)http://swish-e.org/

MG4J (2005) http://mg4j.di.unimi.it/9

M Khabsa, S Carman, S.R. Choudhury and C.L Giles, 2012, “A Framework for Bridging the Gap Between Open Source Search Tools”, In proceedings of the SIGIR 2012 Workshop on Open Source Information Retrieval

A.K. Barman, J Sarmah and S.K. Sarma, “Development of Assamese Rule based Stemmer using WordNet”. In proceedings of the 10th Global WordNet Conference, pages 135-139, 2019

A.K. Barman, J Sarmah, S.K. Sarma, “POS Tagging of Assamese Language and performance Analysis CRF++ and fnTBL approaches” In Proceedings of UKSim 15th

International Conference on Computer Modelling and Simulation, pages 476-479, 2013

A.K. Barman, J Sarmah, S.K. Sarma, “Automatic Identification of Assamese and Bodo Multiword expressions” In proceedings of ICACCI 2013, pages 26-30

A.K. Barman, J Sarmah, S.K. Sarma, “WordNet based Information Retrieval System for Assamese” In Proceedings of UKSim 15th International Conference on Computer Modelling and Simulation, pages 480-484, 2013


Refbacks

  • There are currently no refbacks.


------------------------------------------------------------------------------------------------------------------------

The “ADBU Journal of Engineering Technology (AJET)" ISSN:2348-7305

This journal is published under the terms of the Creative Commons Attribution (CC-BY) (http://creativecommons.org/licenses/)

Number of Visitors to this Journal:web counter