General description: This project is a Java Applet for detecting language of a given sentence.
Contributors: Kazem Jahanbakhsh
Implementation period: May 2008
I implemented this program as an assignmnet for my "Artificial Intelligence" course in Summer 2008. The code takes a sentence (or a list of words) in one of three languages English, Spanish, and French as input and outputs the detected language for the input. For doing this, I implemented and trained a "Naive Bayes" classifier.
The program's output does not rely on words accents from Spanish or French language. This means that the user can replace French character "â" with "a" or Spanish character "ó" with "o" when providing the input. The program uses a small corpus of important words for each language (this is collected by another program) plus the occurence likelihood of each word for each language. The code uses the Bayes' theorem to compute the likelihoods that the given input is written in any of three languages.
When I was parsing a few number of articles in different languages to build my training data, I found that there is a few number of important words which play essential roles in human languages including English, Spanish, and French. In particular, I found the following words as main features for detecting the language of an input text:
|Category||examples (in english)|
The following table also shows a list of interesting words I found for French and Spanish languages:
Click language recognition to test the program!
*** You can download language recognition source code from its github repository: language recognition.
Used technology: Java Applet
No of lines: 1276
Used data structures: Java Hashtables
You should follow Follow @kjahanbakhsh me on Twitter.