Thursday, October 3, 2019

How does the machine do reading and understand "beyond humans"? Microsoft Asia Research Institute reveals your technology


On January 3, 2018, Microsoft Asia Research's r-net took the lead in the Squad EM value of 82.650, which means that for the first time in the Exact Match indicator, it surpassed the 82.304 set by humans in 2016. Xinzhiyuan interviewed the MSRA Zhou Ming team for the first time, and explained the EM and F1 in detail for the readers. It transcends the specific connotation of human beings, the core problem that NLP is most difficult to break through, and the development status and future prospects of natural language processing technology in China. .

What are the EM and F1 values? What is ensemble? What is the difference between the model and the single model?

There are two evaluation indicators EM and F1 in the Squad competition.
EM (Exact Match) requires that the answer given by the system and the person's annotated answer match exactly (in order to remove the punctuation and the article: a, an, the), the exact match is 1 point or not.
F1 calculates a score between 0 and 1 based on the degree of coincidence between the answer given by the system and the person's annotated answer, that is, the correctness of the word level and the harmonic mean of the recall rate.
For example, suppose the answer to a question is "Denver Broncos". The system only gives an output that exactly matches the annotated answer (ie "Denver Broncos"), EM will get 1 point, otherwise it will not score.
For F1, even if the answer output by the system is not exactly the same as the answer of the person, for example, the system outputs “Broncos”. Although the EM score is 0, it will get a part of the score on the evaluation index of F1 (0.67). .
EM is a more demanding evaluation indicator and the first time MSRA has exceeded the results of SQuAD.
Model integration (ensemble) is a common way to improve system performance. Since the initialization of the neural network model and the training process are random, the same algorithm will get different models when trained on the same data multiple times.
Model integration is the training of multiple single models, and then the output of these single models is combined to get the final result.
Integrated models generally perform better than single models, but at the expense of system responsiveness and computing resources. In practice, a balance needs to be made between model effects and model efficiency (better and faster).

What is the specific meaning of transcending human beings?

How does the machine do reading and understand "beyond humans"?  Microsoft Asia Research Institute reveals your technology
SQuAD has at least three answers to each question on the test data set (at least 3 people have an answer to each question). SQuAD will use the second answer as the person's prediction and the remaining answer as the standard answer.
For EM indicators, the predicted answer is the same as any standard answer. For the F1 indicator, the highest score among all standard answers is selected as the score. This gives the person's EM score (82.304) and F1 score (91.221).
At the beginning of this competition in 16 years, our Microsoft Asia Research Institute submitted almost every model. At the end of 2017, our score of 82.136 is very close to the human standard, only 0.17 points. This time our model's EM value reached 82.650, surpassing the human point of 0.3 points in the accurate answer. To put it simply, you can understand these 0.3 points in this way. Our system is more than 30 questions for people who do this set of questions.
This far does not mean that the computer transcends the level of human reading comprehension, because such running results are subject to a precondition, such as in the determined question bank and test time, and only the average level of adult understanding.
Beyond the fact that human beings cannot be used as media reports, while we are seeing technological progress, we should calmly think about the continuous improvement of models and the application of technology. This is an ecology, it requires all players to compete healthily, and to overcome the difficulties faced at this stage, rather than staying in the stage of the first joy of the game.

What is the core technology problem that NLP is most difficult to break through?

At present, the top ranked systems on the SQuAD list use end-to-end deep neural networks. Generally contains the following parts:

Embedding Layer: Generally used are word vectors pre-trained on external large-scale data (such as Glove, etc.), and word vectors (representations) from characters to words based on circular neural networks or convolutional neural networks. Get the problem and the context-independent representation of each word in the article paragraph. Some models also extract some features and word vectors together as input to the network. It is equivalent to reading knowledge of human vocabulary.

Encoding Layer: A multi-layered cyclic neural network is generally used to obtain a context-dependent representation of each word of the question and the passage of the article. It is equivalent to reading the question and the passage of the article.

Matching Layer: Actually, it is the correspondence (or matching) between the words in the question and the paragraph words in the article. Basically, it is implemented by an attention mechanism. The common ones are based on Match-LSTM and Co-attention, so that the problem-related representation of each word in the article is obtained. It is equivalent to reading the passage of the article with a question.

Self-Matching Layer: Based on the word representation related to the problem, the self-attention mechanism is used to further improve the representation of the words in the passage. It is equivalent to reading the passage of the article again, reading the book a hundred times, and its righteousness.

AnswerPointerLayer: predicts the probability that each word in the article paragraph is the beginning of the answer and the end of the answer, so as to calculate the substring output with the highest probability of the answer in the article paragraph as the answer. This is generally implemented using Pointer Networks. Equivalent to all the clues and knowledge of the person's position in the paragraph of the article.
In fact, the current top-ranked systems on SQuAD are similar and similar in terms of models and algorithms.
This is also the result of the joint efforts, mutual learning and improvement of the entire reading comprehension research community and colleagues (from different schools, companies, research institutions) for more than a year since the SQuAD competition.
The best models available today generally combine the following algorithms or components, including early base models.
For example, innovations in the attention mechanism of Match-LSTM (Singapore Management University) and BiDAF (Allen Institute for Artificial Intelligence) (such as Salesforce's Coattention mechanism, Gated-Attention mechanism in R-NET, etc.), Self-R-NET The Matching (or Self-Attention) mechanism, as well as the context-prepared Contextualized Vectors for the pre-training of the model's effects, including neural machine translation training to obtain sentence encoders (Salesforce) and based on large-scale external A two-way language model (Allen Institute for Artificial Intelligence) obtained by text data training.
Of course, there are improvements and innovations in the design of network models and parameter tuning methods. It can be said that the current result is actually the result of continuous efforts and cooperation of the entire reading and understanding community over the past year.

Is Chinese reading comprehension more difficult than English?
From the results of the current research stage, I did not see the paper saying that Chinese reading comprehension must be more difficult than English. I feel that each has its own difficulties. For example, Chinese idioms and English proverbs are difficult, two languages. The references are also different. It is necessary to analyze the specific scene and adjust the model continuously.
Now there is a Chinese reading comprehension competition in China. It is jointly sponsored by China Chinese Information Society (CIPS) and China Computer Society (CCF). Baidu Company, China Chinese Information Society Evaluation Working Committee and Computer Society Chinese Information Technology Committee Undertake. The competition will officially open the registration channel on March 1, 2018. The winning team will share a total of 100,000 RMB and will hold technical exchanges and awards at the 3rd Language and Intelligence Summit.
This is a very good thing, the competition data set contains 300,000 real questions from Baidu search, each question corresponds to 5 candidate document texts, as well as artificially written high-quality answers.
A task in a game is usually defined as: having the machine read the text and then answering and reading the content-related questions. Reading comprehension involves complex techniques such as language understanding, knowledge reasoning, and abstract generation, which are extremely challenging.
The research of these tasks is of great significance for artificial intelligence applications such as intelligent search, intelligent recommendation, intelligent interaction, etc. It is an important frontier topic in the field of natural language processing and artificial intelligence.

The introduction of machine reading comprehension technology

 Machine reading comprehension technology has a wide range of application scenarios.
In search engines, machine reading comprehension techniques can be used to provide smarter answers to user searches (especially problem-based queries). At present, R-NET technology has been successfully applied in Microsoft's Bing search engine. We provide accurate answers directly to users by reading and understanding the documentation of the entire Internet.
At the same time, this is also a direct application in personal assistants in mobile scenarios, such as Microsoft's Cortana.
In addition, machine reading comprehension technology is also widely used in the commercial field. For example, in intelligent customer service, the machine can read text documents (such as user manuals, product descriptions, etc.) to automatically or assist customer service to answer user questions.
In the office field, machine reading comprehension technology also has a good application prospect. For example, we can use machine reading comprehension technology to process personal emails or documents, and then use natural language queries to obtain relevant information.
In addition, machine reading comprehension technology has a very broad application prospect in the vertical field, for example, in the field of education to assist in the problem, in the legal field to understand the legal provisions, to assist lawyers or judges in judging cases, and in the financial field from unstructured The text (such as in the news) extracts financial related information and so on.
We believe that reading comprehension is one of the most critical abilities in human intelligence. Machine reading comprehension technology can be a universal ability to be released to third parties for building more applications.

Machine reading comprehension technology 2019 and beyond

Technically, there is still a lot of space for algorithms and models based on deep learning. Can you propose effective modeling for complex reasoning and effective use of common sense and external knowledge (such as knowledge base)? The Internet is a very interesting research topic.
In addition, the current machine learning comprehension model based on deep learning is black box, it is difficult to visually represent the process and results of the machine for reading comprehension, so the interpretable deep learning model will also be an interesting research direction.
In the reading comprehension task, the answer in the current task definition of SQuAD is a sub-segment of the original text, and in practice, people may need to perform more complicated reasoning and organize new words to re-express after reading the article. In this regard, the MARCO dataset released by Microsoft is working in this direction.
In addition, since the current SQuAD data set assumes that each question must be found in the corresponding document paragraph, this constraint is reasonable and effective for the game and research, so the existing model is not determined even if it is not certain. One of the most likely document fragments is output.
This assumption and the output of the model are not reasonable in practical applications. Humans have a very important ability to read and understand answers to questions. You can know that if there is no answer in the text you read, you will refuse to answer.
This problem is a very important research topic, both in research and in practical applications. We are already doing research in this area and have made some good progress

No comments:

Post a Comment