# METEOR#

METEOR (Metric for Evaluation of Translation with Explicit ORdering) is a widely recognized and vital metric used in natural language processing. Originally developed for machine translation workflows, it is used to measure the quality of candidate texts against reference texts for many different workflows. Though it is an n-gram based metric, it goes beyond traditional methods by factoring in elements such as precision, recall, and order to provide a comprehensive measure of text quality. For an in-depth justification behind METEOR's design choices, feel free to check out the original paper.

## Implementation Details#

We define METEOR as the product of two components - the Unigram Precision / Recall Harmonic Mean, and Word Order Penalty. That is,

$\text{METEOR} = \underbrace{\text{FMean}}_{\text{Harmonic Mean of Unigram Precision/Recall}} * \underbrace{(1 - \text{Penalty})}_{\text{Word Order Penalty}}$

To understand the formula, let's break down each component into their respective parts.

FMean: Harmonic Mean of the Unigram Precision / Recall

This is defined as

$FMean = \frac{10PR}{R + 9P}$

where P represents the unigram precision, and R represents the unigram recall. Here's a recap of precision and recall. Notice that most of the weight is placed on the recall component by design – this allows METEOR to prioritize the coverage of essential keywords in the candidate text.

Penalty: Word Order Penalty

Since the FMean is based on unigram precision and recall, to take into account longer sequences, METEOR has a penalty factor to alleviate this weakness and enforce an order on the candidate sentence.

First, the unigrams in the candidate text that are mapped to unigrams in the reference text are grouped in such a way that there exists the fewest number of chunks, where each chunk consists of adjacent unigrams. Our penalty factor is then defined as:

$Penalty = 0.5 \times \frac{\text{# of Chunks}}{\text{# of Unigrams Matched}}$

For example, if our candidate sentence was "the president spoke to the audience" and our reference sentence was "the president then spoke to the audience", there would be two chunks – "the president" and "spoke to the audience" – and 6 unigrams matched. Notice that as the chunks decrease, so does the penalty, which results in a higher METEOR score. This is quite intuitive as a lower number of chunks translates to an enforced order and better alignment with the reference text.

## Examples#

Candidate Reference
Under the starry night, we danced with glee. We danced with joy under the starry night.
Step 1: Calculate FMean

Upon analysis, our precision is $$\frac{7}{8} = 0.875$$ and our recall is also $$\frac{7}{8} = 0.875$$. As a result, our FMean is

$\frac{10 \times 0.875 \times 0.875}{0.875 + 9 \times 0.875} = 0.875$
Step 2: Calculate Word Order Penalty

We can break up our candidate sentence into two chunks to map it to our reference sentence.

Candidate: $$\underbrace{\text{Under the starry night}}_{\text{Chunk 2}} \space \underbrace{\text{we danced with}}_{\text{Chunk 1}} \space\text{glee}$$
Reference: $$\underbrace{\text{We danced with}}_{\text{Chunk 1}} \space\text{joy}\space \underbrace{\text{under the starry night}}_{\text{Chunk 2}}$$

Between the two chunks, we have matched 7 unigrams. This gives us a penalty score of $$0.5 \times \frac{2}{7} = 0.143$$.

Step 3: Calculate METEOR

With our Penalty and FMean calculated, we can proceed with calculating the METEOR score.

$\text{METEOR} = 0.875 * (1 - 0.143) = 0.750.$

Not bad! We have a pretty high score for two sentences that are semantically very similar but have different orders.

Lets try the same reference example with a slightly different candidate.

Candidate Reference
Danced we with under joy the night starry. We danced with joy under the starry night.
Step 1: Calculate FMean

Our first step is trivial. Since both sentences contain the same words, our unigram precision and recall are both 1.0. As a result, our FMean is $$\frac{10 \times 1.0 \times 1.0}{1.0 + 9 \times 1.0} = 1$$

Step 2: Calculate Word Order Penalty

Our penalty is different from the first example, due to the jumbled up order. We split our candidate sentence into 8 chunks, since no adjacent words can be mapped to the reference sentence.

Candidate: $$\underbrace{\text{Danced}}_\text{Chunk 2}\space\underbrace{\text{we}}_\text{Chunk 1}\space\underbrace{\text{with}}_\text{Chunk 3}\space\underbrace{\text{under}}_\text{Chunk 5}\space\underbrace{\text{joy}}_\text{Chunk 4}\space\underbrace{\text{the}}_\text{Chunk 6}\space\underbrace{\text{night}}_\text{Chunk 8}\space\underbrace{\text{starry}}_\text{Chunk 7}\space$$

Reference: $$\underbrace{\text{We}}_\text{Chunk 1}\space\underbrace{\text{danced}}_\text{Chunk 2}\space\underbrace{\text{with}}_\text{Chunk 3}\space\underbrace{\text{joy}}_\text{Chunk 4}\space\underbrace{\text{under}}_\text{Chunk 5}\space\underbrace{\text{the}}_\text{Chunk 6}\space\underbrace{\text{starry}}_\text{Chunk 7}\space\underbrace{\text{night}}_\text{Chunk 8}\space$$

Between the eight chunks, we have matched 8 unigrams. This gives us a penalty score of $$0.5 \times \frac{8}{8} = 0.5$$.

Step 3: Calculate METEOR

With our Penalty and FMean calculated, we can proceed with calculating the METEOR score.

$\text{METEOR} = 1 * (1 - 0.5) = 0.5.$

Despite having all the keywords of the reference sentence, our candidate had the wrong order and meaning! This is a massive improvement over something like ROUGE-1 which would not have considered the orders of the sentences, and given a perfect score of 1.0.