-Nirav Shah and Sumit Goswami
Machine translation is a growing research area due to its application in
providing fast and meaningful translation of text and speech from one language
to another language. It can be done based either on rules where rules are
applied to convert one language into another language or through statistical
machine translation (SMT). Statistical machine translation uses statistical
methods to translate with the help of parallel corpus. It uses word based
translation method or phrase based translation. The success of machine
translation system depends on how well one language's words are aligned with
another language's words. Statistical machine translation system allows training
of translation model for any language. The requirement for doing statistical
machine translation is a bi-lingual parallel corpus. There are various types of
translation methods which are used like factored, beam-search and phrase based.
Direct Hit! |
Applies To: Language Translators, Computational Linguists, NLP Researchers Price: Free (GNU GPL) USP: Create your language translator Primary Link: www.statmt.org/moses Google Keywords: Statistical Machine Translation, Moses |
Various machine translation tools are available like Apertium (GNU license),
OpenLogos which is the open source version of Logos Machine Translation System,
SYSTRAN which is one of the oldest Machine Translation company and Moses (GNU
General Public License).
Moses is a phrase based machine translation tool for converting one language
to another language. Technical details regarding this are available on Moses
website, www.statmt.org/moses. In this article, we will give a short
step-by-step process for converting text from one language to another, for
example Hindi to English using Moses machine translation tool.
Parallel Corpus
Prepare Parallel Corpus for source (English) and target (Hindi) language
which will be used for training the language model. This corpus can be prepared
from your existing translated data or can be obtained from Internet free of cost
or with a price for e.g. EMILLE corpus, a free version of which is available for
research purposes. Similarly, a parallel corpus with a smaller size is required
for tuning as well as for testing the model.
Data preparation
The parallel corpus is converted to a format that is suitable for Giza++.
Giza is an open source tool based on IBM model and is used for word alignment.
Before training Moses the following software should be downloaded:
- SRILM < http:// tinyurl. com/ dx8m5m > - This is the tool developed by
Stanford research institute for building statistical language model. - GIZA++ < http:// tinyurl. com/ cdem45 > or < http:// giza-pp. googlecode.
com/ > — This tool is developed by Franz Josef Och. This tool implements
different models like HMM and also performs word alignment. - MKCLS <http://tinyurl.com/ c83mpx > or < http://www.fjoch. com/mkcls. html
> - This tool is also developed by Franz Josef Och and used for training word
classes which is used in SMT model. For MKCLS and GIZA++ latest GNU compiler
is required. - Moses < http://sourceforge.net/ projects/mosesdecoder >
- Additional scripts < http://tinyurl. com/ cp8xz7 > - These are the
additional scripts for Moses training and tuning.
For this article we are keeping /usr/home/PCQ/demo as the root directory for
installation. The steps given below are relative to this root directory.
Getting started
Create a directory 'srilm' in root directory, move downloaded srilm tar file to
this directory, extract and run the 'make file'.
Then move GIZA++ tar file to the root directory and extract. This process will
create a directory GIZA++-v2. Run the make file inside GIZA++-v2 and thereafter
again run make with argument snt2cooc.out from the same directory. This will
produce GIZA++ and snt2cooc.out. Create a directory 'bin' in root directory and
copy these two files to 'bin'.
Now move the mkcls tar file to root directory and extract. This will make
mkcls-v2 directory in which you should run the make file. This will produce
mkcls file which is copied to 'bin'.
Create a directory named Moses under root and copy Moses tar file to this
directory and extract it. Now change to Moses directory and execute regenerate—makefiles.sh.
and thereafter execute configure script with option ./configure --with-srilm=/usr/home/PCQ/demo/srilm
and then run 'make — J 4'.
Now move to 'bin' directory under root and create 'moses-scripts' directory
in it. Thereafter move to moses/scripts directory under root and run make
release. Move scripts tar file to root directory and extract it. This process
will complete the set up of Moses.
Training the Translator
Now we can process our parallel corpus. Create directories
working-dir/corpus in root directory and copy the English and Hindi corpus in
it. Filter out long sentences and lowercase the data. We do not require to
lowercase Hindi data because it is in wx format. Now we create a directory 'lm'
inside 'working-dir' directory to build the language model. Again we need to
lowercase the English language data. These two steps creates English language
model data. Now we can build the model using SRILM. After this process, our
language model is ready and now we can train our model. After training, for
better performance we can tune it also. However, tuning is not mandatory.
For translating, English language sentence can be given as input echo 'This
is a small House.' | /usr/home/PCQ/ demo/moses/moses-cmd/src/moses -f moses. ini
> out.txt.
We can find its translated Hindi language sentence by 'cat out.txt' which
will contain this: yaha eka CotA AvAsagqha Hai.
Conclusion
If the corpus is large in size then it requires huge memory, at least 2 GB
for building the translation model. Besides, a few steps in the process can take
few minutes to few hours depending on the processing power, memory and size of
training corpus. The model given here is base line model and research is going
on to improve the results of translation. If the corpus is large enough, then
the trained model will have a high translation accuracy.