F1 score in NLP span-based Question Answering task
In the context of span-based Question answering, we are going to look at what the F1 score means.
Let us first give a few definitions.
Span-based QA is a task where you have two texts, one called the context, and another called the question. The goal is to extract the answer of the question in the text, if it exists.
For instance:
context:
“Today is going to be a rainy day and tomorrow it will be snowy.”
Question:
“What is the weather like today?”
Gold answer:
“rainy”
Extracted answer (by our QA algorithm)
“rainy day”
F1 score formal definition is the following:
F1= 2*precision*recall/(precision+recall)
And, if we further break down that formula:
precision = tp/(tp+fp)
recall=tp/(tp+fn)
where tp stands for true positive, fp for false positive and fn for false negative.
The definition of a F1 score is not trivial in the case of Natural Language Processing (NLP).
Below, I make the following hypotheses on what some of the terms above mean in the context of our NLP span-based QA task:
tp: number of tokens* that are shared between the correct answer and the prediction.
fp: number of tokens that are in the prediction but not in the correct answer.
fn: number of tokens that are in the correct answer but not in the prediction.
*a token is a unit of language that is used as input for our models in NLP.
Because I could not verify my claim on internet, I changed my approach. Instead of looking for a definition, I went to see extracts of code used to evaluate span-based QA.
After looking for a while, I finally found what I wanted, from the SQuAD* website, where they let us download the script they use for evaluation.
*“Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.”
I extracted below the code of the function from the script that is of interest to us: compute_f1.
We see that:
precision = 1.0 * num_same / len(pred_toks)=tp/(tp+fp)
recall = 1.0 * num_same / len(gold_toks)=tp/(tp+fn)
We see that my hypotheses are in line with the above, and we have:
tp=num_same=number of tokens that are shared between the correct answer and the prediction
fp=len(pred_toks)-num_same=number of tokens that are in the prediction but not in the correct answer.
fn=len(gold_toks)-num_same=number of tokens that are in the correct answer but not in the prediction.
That’s all, hope I cleared up any misunderstanding you had on F1 scoring in the context of span-based QA :)