At Qwant, we have been working on these approaches for several years now, and we propose in this article to deepen the subject and the technique of understanding.
In AI, one of the main challenges is to allow computers to understand human speech through, for example, a voice assistant. These oral dialogue systems are interfaces between users and services, such as chatbots or voice assistants, that allow users to order objects, manage their calendar, find their way, answer questions, book trains, flights, a restaurant, etc. Dialogue systems are composed of several modules. Key modules include automatic speech recognition (ASR), speech understanding (SLU), dialogue management, and speech synthesis (Figure 1). Here, we will focus on speech understanding, the component of dialogue systems that aims to understand human needs.
Today, speech comprehension (SLU) is an integral part of many applications, but very few companies, outside of GAFAM or BATX, provide this service and have expertise in such a difficult to manage field.
Coming from a long history of theoretical linguistics, the SLU has several variations and can have several aspects. One of these aspects is the ability to relate a word or sequence of words to a meaning, or concept. That is where our work is focused.
*COLING2020, the 28th International Conference on Computational Linguistics, is a major conference in Natural Language Processing
At QWANT, we have been working on these approaches for several years now, and we propose in this article to deepen the subject and the technique of understanding. We will also explain how we can adapt this work to our industrial needs and how we compare them, which allowed us to publish in one of the main conferences on Natural Language Processing this year: COLING 2020.
Semantic interpretation made by a computer is a process of conceptualization of the world using a tool to create a structure for the representation of meaning, from different symbols and their characteristics present in words or sentences. Speech understanding is the interpretation of these symbols included in speech.
This representation aims to translate human natural language into a form that can be interpreted by the computer, which allows it, for example, to query a database. This step is crucial in a voice assistant and dialogue system, is a part of the intelligence of the system, which makes it possible to extract the very meaning of a sentence.
Several systems and approaches have been around for years and are continually evaluated against traditional comprehension tasks. One of these comprehension tasks is the MEDIA Task task, described in the next section.
The French MEDIA Task aims to evaluate SLU through a spoken dialogue system dedicated to tourist information and hotel booking. More than 1,250 dialogues were recorded, transcribed and annotated manually to prepare this assessment. Created in 2005 for a French evaluation campaign entitled TECHNOLANGUE/EVALDA/MEDIA, many systems have been evaluated for years, which has made it possible to become an essential reference in the world of speech comprehension.
Figure 2 shows an example of labeling applied to a sentence in the MEDIA corpus. Recently identified as one of the most difficult tasks to understand, we study and adjust our models to achieve the best results. Our models of understanding will meet this challenge.
We decided to take up the challenge of the MEDIA task, using the state-of-the-art approaches that exist in Natural Language Processing. Most of these approaches have been successfully applied to English corpora, but few have been tested on a morphologically richer language such as French. Most of these models have been studied and applied for years at Qwant, they use recent advances in Artificial Intelligence models: a Neural Network. The first uses ordered sequences of words and characters as input to label ordered sequences of words (also known as the sequence-to-sequence approach).
This input is given to a specific neural network model (a convolutional layer), which extracts the features and gives them as input to another (a Bi-LSTM), which encodes them and offers its output to a decoder and a decision layer (a CRF). This model was originally proposed by Ma and Hovy , which we adapted and modified according to our needs.
The second model used is the famous BERT model and more particularly its French version CamenBERT, proposed by a French research team from INRIA (ALMAnaCH) . This last model is a pre-trained model, which used more than 190GB of training data. This model was downloaded and adapted to our task (or “fine-tuned”). Both models will be evaluated on the MEDIA task.
 Bi-Directional Long-Short-Term-Memory model
 CRF for Conditional Random Fields
To evaluate our models on the comprehension task, we need to use two measures, the first is a conceptual error rate (CER), which indicates the error rate between a hypothesis and a reference (the lower the better). The second is the F-measurement, which is an average between precision and recall measurements. The precision indicates how many selected solutions are relevant, the reminder is the number of relevant elements selected (see Figure 4). If we were fishing, accuracy would be associated with angling, and recall with net fishing. For accuracy, recall and F-measurement: the higher the measurement, the better.
Table 1 shows the results obtained on the MEDIA task according to the F-measure and the CER. SOTA indicates the current state of performance in the scientific literature (State of the Art), BiLSTM-CNN-CRF refers to the first model described in the Model section. Finally, the best approach is the Fine-Tuned CamemBERT, which makes it possible to create a real breakthrough in this task.
While the classic Qwant model is a little below the state of the art, the French version of BERT allows us to improve by nearly 2 points of CER and more than 1.4 points of F-Mesure.
Today, even with the latest generation servers, scaling BERT models is a complex task. Especially when inference time is crucial. Indeed, we have the challenge of processing 1,000 requests per second (1ms to answer the request). The production version of our BiLSTM-CNN-CRF can reach 20 ms per request, without any optimization. In parallel, we are working on BERT models optimized for use in production.
Scaling up is one of the main challenges to the use of artificial intelligence in business. This problem has been addressed by many researchers and engineers, most of the time the solution lies in reducing the complexity of the model, together with a quantification operation. This last operation aims to reduce the numerical representation of the number in a model (from 32 bits to 8 bits, most of the time).
At Qwant, we are working on these optimizations so that these models are up to our challenges. We believe they can improve our search engine results and user experience.
This work was co-written by Sahar Ghannay, Sophie Rosset (Université Paris-Saclay, CNRS, LIMSI) and Christophe Servan (Qwant). This work was funded in part thanks to the LIHLITH project (ANR-17-CHR2-0001-03), with the support of the ERA-Net CHIST-ERA, that of the “Agence Nationale pour la Recherche” (ANR, France).
 Hélène Bonneau-Maynard, Sophie Rosset, Christelle Ayache, Anne Kuhn and Djamel Mostefa. “Semantic annotation of the corpus of French media dialogue”. In the Proceedings of the European Conference on Voice Communication and Technology (Eurospeech), Lisbon, Portugal, 2005.
 Frédéric Béchet and Christian Raymond. “Benchmarking benchmarks: introducing new automatic indicators for benchmarking spoken language understanding corpora”. In Interspeech, Graz, Austria, 2019
 Xuezhe Ma, and Eduard H. Hovy. “End-to-end sequence labeling via bidirectional LSTM-CNNs-CRF”. In the Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, 2016.
 Louis Martin, Benjamin Muller, Pedro Javier Ortiz Suárez, Yoann Dupont, Laurent Romary, Éric de la Clergerie, Djamé Seddah, Benoît Sagot. “CamemBERT: a tasty model of the French language”. In the Proceedings of the 58th Annual Meeting of the Association of Computational Linguistics, 2020.