Monday, June 18, 2012

Language Identification: the road ahead and the way behind

whatlanguageis.com started as a proof-of-concept of a simple idea: using stopwords as a mean to identify a language. Stopwords are the glue in common speach, and are usually among the most common words. As you may guess if you think about it, in English the is a stopword (and a frequent word, too!)

Python's Natural Language Toolkit (NLTK) offers a set of stopwords for several languages, and in my language identification proof-of-concept, written in Python, I use these sets. The method is easy so far: get the most common words in the text and compare with known stopword lists, which act as a kind of virtual language fingerprint. The language with the most hits is the winner! Problem: stopwords are not a good fingerprint.

This approach is just a dirty hack: it usually works but needs quite a lot of text to do so. It didn't bother me at all the day I wrote it in just like 30 lines of Python, but then ... One day I woke up, decided I wanted to learn Python web development and started reading the Django book. This resulted in a hunt for a simple project to test I understood the basic concepts behind the MVC model that Django offers, and I ended writing whatlanguageis.com as a test of it. Look, it works!

Now that the site is up and running though, I want to improve the language detection backend. How?

  • Stop using stopwords and use custom-made corpora. A corpus (whose plural is corpora) is a set of words extracted from a document. I have to choose adequate documents for this to work.
  • Since I will be using custom-made corpora I will be able to add many more languages... Hopefully I will also add programming languages.
  • Improve the printing of results, adding a range when needed. The identifier assigns a score to each language... And currently the maximum gets all. I.e. if English has 5 and German and Dutch have 4, we'll get English as language, when German and Dutch are also likely choices. I want to output in this case something like This is quite likely an English text, but may also be German or Dutch.
Although it is not exactly a bump in the road, I want to beautify a little the source I'm using currently (which is just a Python file on its own) and convert it to a module I can reuse, with its own unit tests and anything I need to make sure it works as a charm.

As you may imagine, this is quite a lot of work, and also clashes with some of my other projects... But some day or another that beta sign in the upper-left corner in whatlanguageis.com will be gone.