Why doesn’t machine translation help with multilingual deep learning?

Machine translation is becoming better and better at a speedy pace. But is it at a level we expect for all tasks?

Share Post:

I recently participated in my first Kaggle competition and — wanting it to be something related to natural language processing — I opted for Contradictory, My Dear Watson, a competition dealing with natural language inference (NLI) with the goal to establish whether one sentence entails another, contradicts it, or is unrelated to it.

I haven’t realized going into the competition that I’d be tackling something that I haven’t dealt with in the past couple of years — a multi-language classification task. As soon as I realized this, I was super-excited to harness the power of Google Translate’s new and improved translation algorithm.

The logic behind my approach was the following:

  • multilingual tasks are hard, and pre-trained models like XLM-RoBERTa or m-BERT do not have all the languages this task features
  • the competition dataset is imbalanced — nearly 57% of examples are in English while other languages have a share of under 3% each
  • there are other natural language inference datasets that I could use for fine-tuning, but only XNLI is multilingual and the others are in English
  • having one language would reduce the dimensionality of data and the feature-space the model would have to deal with

Spoiler alert: The translation approach sucked

My baseline accuracy for this task was 65%. Translating the competition dataset to English reduced the accuracy by 6%.

via GIPHY

So, not only did using translated data not improve the performance of the classifier, I had to bring the data over to a Google Sheet to be able to translate it as most Python-based translation libraries either a) did not work due to a recent update on Google’s end, b) were rate limited, or c) worked too slowly to be able to use them real-time in my code (possibly due to b).

In the denial stage of the grieving process, I was thinking that the drop in accuracy may be due to the specific nature of the task at hand. Future me discovered that others have seen a similar effect in sentiment analysis and in chatbots.

So, why is translation (still) a bad idea?

A distinction should be made here — machine and human translation could potentially perform differently in this case. There are papers showing that machine translation can be distinguished from human translation by automatic classification systems at pretty decent accuracy levels (~75%), and a study on the topic of using multilingual versus translated data in sentiment analysis suggests that translation approaches have worse results due to the low quality of machine translation available for most languages.

It seems that poor translation introduces noise into the data that negatively affects model performance. My intuition about the state of Google Translate seems to have been wrong. Interestingly, a week later, in an unrelated discussion, a colleague expressed the same expectation — he thought machine translation should be “good enough” by now.

One way to verify this would be to compare classification performance on human translated versus machine translated data.

Guess what my next research topic is! 😊

P. S. I was amazed by the creativity of other participants in this competition on Kaggle who used Google Translate to augment the competition dataset by generating translations of the existing examples into languages with fewer data samples! Somehow, ideologically, I prefer this idea more than translating everything into English, but unfortunately, I imagine it encounters the same problem I describe above, perhaps even more so due to known variability between translation quality depending on the language pair and translation direction.

P. P. S. While (machine) translation may not be the best option for tasks that have robust multilingual models that support working with a range of languages other than English, another study showed some interesting variation in the success of translation approaches in cases where the quality of tools available for a given language is not up to par with tools available for English.

Read more

Related posts