Unigram language identifications using adaptive neutral network
In general, a web document page may contain several script forms. Each script can be used for constructing different languages. Determining the languages of the document is the required to effectively be able to apply many search and information retrieval techniques. In this work, we propose hybrid-...
Saved in:
| Main Authors: | , |
|---|---|
| Format: | Book Section |
| Published: |
Institute of Electrical and Electronics Engineers
2008
|
| Subjects: | |
| Online Access: | http://eprints.utm.my/12790/ http://eprints.utm.my/12790/ http://eprints.utm.my/12790/ |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | In general, a web document page may contain several script forms. Each script can be used for constructing different languages. Determining the languages of the document is the required to effectively be able to apply many search and information retrieval techniques. In this work, we propose hybrid-grams feature selection methods by integrating unigram and bigrams. The method makes use of local statistical information or data within a document to determine the language. From the experiments, we have noticed that hybrid-grams are outperformed than unigram and bigrams in Cyrillic and Indic script language identifications. |
|---|