On January 3, 2018, Microsoft Asia Research's r-net took the
lead in the Squad EM value of 82.650, which means that for the first time in
the Exact Match indicator, it surpassed the 82.304 set by humans in 2016.
Xinzhiyuan interviewed the MSRA Zhou Ming team for the first time, and
explained the EM and F1 in detail for the readers. It transcends the specific
connotation of human beings, the core problem that NLP is most difficult to
break through, and the development status and future prospects of natural
language processing technology in China. .
What are the EM and
F1 values? What is ensemble? What is the difference between the model and the
single model?
There are two evaluation indicators EM and F1 in the Squad
competition.
EM (Exact Match) requires that the answer given by the
system and the person's annotated answer match exactly (in order to remove the
punctuation and the article: a, an, the), the exact match is 1 point or not.
F1 calculates a score between 0 and 1 based on the degree of
coincidence between the answer given by the system and the person's annotated
answer, that is, the correctness of the word level and the harmonic mean of the
recall rate.
For example, suppose the answer to a question is
"Denver Broncos". The system only gives an output that exactly
matches the annotated answer (ie "Denver Broncos"), EM will get 1 point,
otherwise it will not score.
For F1, even if the answer output by the system is not
exactly the same as the answer of the person, for example, the system outputs
“Broncos”. Although the EM score is 0, it will get a part of the score on the
evaluation index of F1 (0.67). .
EM is a more demanding evaluation indicator and the first
time MSRA has exceeded the results of SQuAD.
Model integration (ensemble) is a common way to improve
system performance. Since the initialization of the neural network model and
the training process are random, the same algorithm will get different models
when trained on the same data multiple times.
Model integration is the training of multiple single models,
and then the output of these single models is combined to get the final result.
Integrated models generally perform better than single
models, but at the expense of system responsiveness and computing resources. In
practice, a balance needs to be made between model effects and model efficiency
(better and faster).
What is the specific
meaning of transcending human beings?
How does the machine do reading and understand "beyond
humans"? Microsoft Asia Research
Institute reveals your technology
SQuAD has at least three answers to each question on the
test data set (at least 3 people have an answer to each question). SQuAD will
use the second answer as the person's prediction and the remaining answer as
the standard answer.
For EM indicators, the predicted answer is the same as any
standard answer. For the F1 indicator, the highest score among all standard
answers is selected as the score. This gives the person's EM score (82.304) and
F1 score (91.221).
At the beginning of this competition in 16 years, our
Microsoft Asia Research Institute submitted almost every model. At the end of
2017, our score of 82.136 is very close to the human standard, only 0.17
points. This time our model's EM value reached 82.650, surpassing the human
point of 0.3 points in the accurate answer. To put it simply, you can
understand these 0.3 points in this way. Our system is more than 30 questions
for people who do this set of questions.
This far does not mean that the computer transcends the
level of human reading comprehension, because such running results are subject
to a precondition, such as in the determined question bank and test time, and
only the average level of adult understanding.
Beyond the fact that human beings cannot be used as media
reports, while we are seeing technological progress, we should calmly think
about the continuous improvement of models and the application of technology.
This is an ecology, it requires all players to compete healthily, and to
overcome the difficulties faced at this stage, rather than staying in the stage
of the first joy of the game.
What is the core
technology problem that NLP is most difficult to break through?
At present, the top ranked systems on the SQuAD list use
end-to-end deep neural networks. Generally contains the following parts:
Embedding Layer:
Generally used are word vectors pre-trained on external large-scale data (such
as Glove, etc.), and word vectors (representations) from characters to words
based on circular neural networks or convolutional neural networks. Get the
problem and the context-independent representation of each word in the article
paragraph. Some models also extract some features and word vectors together as
input to the network. It is equivalent to reading knowledge of human
vocabulary.
Encoding Layer: A
multi-layered cyclic neural network is generally used to obtain a
context-dependent representation of each word of the question and the passage
of the article. It is equivalent to reading the question and the passage of the
article.
Matching Layer:
Actually, it is the correspondence (or matching) between the words in the
question and the paragraph words in the article. Basically, it is implemented
by an attention mechanism. The common ones are based on Match-LSTM and
Co-attention, so that the problem-related representation of each word in the
article is obtained. It is equivalent to reading the passage of the article
with a question.
Self-Matching Layer:
Based on the word representation related to the problem, the self-attention
mechanism is used to further improve the representation of the words in the
passage. It is equivalent to reading the passage of the article again, reading
the book a hundred times, and its righteousness.
AnswerPointerLayer:
predicts the probability that each word in the article paragraph is the
beginning of the answer and the end of the answer, so as to calculate the
substring output with the highest probability of the answer in the article
paragraph as the answer. This is generally implemented using Pointer Networks.
Equivalent to all the clues and knowledge of the person's position in the
paragraph of the article.
In fact, the current top-ranked systems on SQuAD are similar
and similar in terms of models and algorithms.
This is also the result of the joint efforts, mutual
learning and improvement of the entire reading comprehension research community
and colleagues (from different schools, companies, research institutions) for
more than a year since the SQuAD competition.
The best models available today generally combine the
following algorithms or components, including early base models.
For example, innovations in the attention mechanism of
Match-LSTM (Singapore Management University) and BiDAF (Allen Institute for
Artificial Intelligence) (such as Salesforce's Coattention mechanism,
Gated-Attention mechanism in R-NET, etc.), Self-R-NET The Matching (or
Self-Attention) mechanism, as well as the context-prepared Contextualized Vectors
for the pre-training of the model's effects, including neural machine
translation training to obtain sentence encoders (Salesforce) and based on
large-scale external A two-way language model (Allen Institute for Artificial
Intelligence) obtained by text data training.
Of course, there are improvements and innovations in the
design of network models and parameter tuning methods. It can be said that the
current result is actually the result of continuous efforts and cooperation of
the entire reading and understanding community over the past year.
Is Chinese reading
comprehension more difficult than English?
From the results of the current research stage, I did not
see the paper saying that Chinese reading comprehension must be more difficult
than English. I feel that each has its own difficulties. For example, Chinese
idioms and English proverbs are difficult, two languages. The references are
also different. It is necessary to analyze the specific scene and adjust the
model continuously.
Now there is a Chinese reading comprehension competition in
China. It is jointly sponsored by China Chinese Information Society (CIPS) and
China Computer Society (CCF). Baidu Company, China Chinese Information Society
Evaluation Working Committee and Computer Society Chinese Information
Technology Committee Undertake. The competition will officially open the
registration channel on March 1, 2018. The winning team will share a total of
100,000 RMB and will hold technical exchanges and awards at the 3rd Language
and Intelligence Summit.
This is a very good thing, the competition data set contains
300,000 real questions from Baidu search, each question corresponds to 5
candidate document texts, as well as artificially written high-quality answers.
A task in a game is usually defined as: having the machine
read the text and then answering and reading the content-related questions.
Reading comprehension involves complex techniques such as language
understanding, knowledge reasoning, and abstract generation, which are
extremely challenging.
The research of these tasks is of great significance for
artificial intelligence applications such as intelligent search, intelligent
recommendation, intelligent interaction, etc. It is an important frontier topic
in the field of natural language processing and artificial intelligence.
The introduction of
machine reading comprehension technology
Machine reading comprehension technology has a wide
range of application scenarios.
In search engines, machine reading comprehension techniques
can be used to provide smarter answers to user searches (especially
problem-based queries). At present, R-NET technology has been successfully
applied in Microsoft's Bing search engine. We provide accurate answers directly
to users by reading and understanding the documentation of the entire Internet.
At the same time, this is also a direct application in
personal assistants in mobile scenarios, such as Microsoft's Cortana.
In addition, machine reading comprehension technology is
also widely used in the commercial field. For example, in intelligent customer
service, the machine can read text documents (such as user manuals, product
descriptions, etc.) to automatically or assist customer service to answer user
questions.
In the office field, machine reading comprehension
technology also has a good application prospect. For example, we can use
machine reading comprehension technology to process personal emails or
documents, and then use natural language queries to obtain relevant
information.
In addition, machine reading comprehension technology has a
very broad application prospect in the vertical field, for example, in the
field of education to assist in the problem, in the legal field to understand
the legal provisions, to assist lawyers or judges in judging cases, and in the
financial field from unstructured The text (such as in the news) extracts
financial related information and so on.
We believe that reading comprehension is one of the most
critical abilities in human intelligence. Machine reading comprehension
technology can be a universal ability to be released to third parties for
building more applications.
Machine reading
comprehension technology 2019 and beyond
Technically, there is still a lot of space for algorithms
and models based on deep learning. Can you propose effective modeling for
complex reasoning and effective use of common sense and external knowledge
(such as knowledge base)? The Internet is a very interesting research topic.
In addition, the current machine learning comprehension
model based on deep learning is black box, it is difficult to visually
represent the process and results of the machine for reading comprehension, so
the interpretable deep learning model will also be an interesting research
direction.
In the reading comprehension task, the answer in the current
task definition of SQuAD is a sub-segment of the original text, and in
practice, people may need to perform more complicated reasoning and organize
new words to re-express after reading the article. In this regard, the MARCO
dataset released by Microsoft is working in this direction.
In addition, since the current SQuAD data set assumes that
each question must be found in the corresponding document paragraph, this
constraint is reasonable and effective for the game and research, so the
existing model is not determined even if it is not certain. One of the most
likely document fragments is output.
This assumption and the output of the model are not
reasonable in practical applications. Humans have a very important ability to
read and understand answers to questions. You can know that if there is no
answer in the text you read, you will refuse to answer.
This problem is a very important research topic, both in
research and in practical applications. We are already doing research in this
area and have made some good progress