Consistency Score#

The consistency score is a numeric result of a simple sampling-based technique which assumes that a Large Language Model (LLM) has inherent knowledge of facts. If prompted multiple times, the LLM should be able to output similar or consistent responses. This technique can be used to measure the factual consistency of an LLM. It requires generating n number of answers with the same prompt, and then comparing the first answer with each subsequent answer for consistency. The more consistent the answers are, the less likely the model is to be hallucinating.

The score is computed by prompting a judging LLM to assess the consistency of each pair of responses. The score ranges from 0 to 1, with 1 indicating that the model consistently returned facts in every sampled response. As the score approaches 0, it indicates that the model is uncertain about its response and likely to be hallucinating.

Implementation Details#

To check for consistency in an LLM's response, the LLM is prompted n times with the same prompt. The temperature parameter is set to a slightly higher value to allow for some variance in the responses.

temperature

The temperature parameter controls the "creativity" or randomness of the text generated by a GPT model. A higher temperature (e.g., 0.7) results in more diverse and creative output, while a lower temperature (e.g., 0.2) makes the output more deterministic and focused.

Rule of Thumb

Based on our experience, we recommend setting n to 5 and temperature to 1.0 for a question answering workflow. However, it is important to try out different values to find what works best for your specific usage.

Given the n responses, a strong LLM serving as a judge, like OpenAI's gpt-3.5-turbo or gpt-4, is prompted for each pair of responses. The LLM judge is asked whether it thinks that the pair of responses contradicts or supports each other. The consistency score is then computed by dividing the number of consistent pairs by the total number of response pairs.

\[ \text{consistency score} = \frac{\text{number of consistent pairs}}{\text{number of total pairs}} \]

Example#

Let's compute the consistency score using OpenAI's gpt-3.5-turbo model as the judging model, given the following three sampled responses:

#	Response
1	`The duck crossed the road.`
2	`The duck did not cross the road.`
3	`The animal crossed the road.`

To make API requests to OpenAI GPT models, you need to install the OpenAI Python library and set up your API key. If you don't have a secret key yet, you can create one on OpenAI's API key page.

pip install openai
export OPENAI_API_KEY=`your-api-key-here`

After setting up an API key, prompt the judging model of your choice for each pair of responses with the following prompt:

Given a pair of texts, where the first one is context, the second one is a
sentence, answer "yes" if the sentence is supported by the context.
Otherwise, answer "no".

How to Prompt gpt-3.5-turbo

import os
from openai import OpenAI

openai = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

PROMPT = """
Given a pair of texts, where the first one is context, \
the second one is a sentence, answer "yes" if the sentence is \
supported by the context. Otherwise, answer "no".
"""

response = openai.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[
        {
            "role": "user",
            "content": PROMPT,
        },
        {
            "role": "assistant",
            "content": "Certainly! Please provide me with the texts for evaluation.",
        },
        {
            "role": "user",
            "content": f"Context: {answer_1}\n\nSentence: {answer_2}",
        },
    ],
    temperature=0.5,
    max_tokens=50,
    top_p=1,
    frequency_penalty=0,
    presence_penalty=0,
)

response = str(response.choices[0].message.content)

Pair #	Response 1	Response 2	Judging Model's Response
1	`The duck crossed the road.`	`The duck did not cross the road.`	`False`
2	`The duck crossed the road.`	`The animal crossed the road.`	`True`

Using the formula above, the consistency score can be computed:

\[ \begin{align*} \text{consistency score} &= \frac{\text{number of consistent pairs}}{\text{number of total pairs}} \\ &= \frac{1}{2} \\ &= 0.5 \end{align*} \]

The consistency score for these responses is 0.5.

Limitations and Biases#

The consistency score is a powerful black-box evaluation metric that has the advantage of evaluating factual consistency without requiring labeled data. However, it can be limiting when evaluating incorrect responses that are consistently incorrect. In addition, there are some disadvantages of using an LLM to evaluate hallucinations.

Cost - Running a large model entails significant expenses. The cost of operating an API model such as GPT-4 is determined by the number of tokens used. However, if you employ your own model as a judging model, the payment might not be calculated based on the token count; nevertheless, there will be additional costs for hardware, computation and maintenance. It is essential to remember that monetary expenses are not the sole consideration. You should also take into account the inference time of the judging model.
Privacy and Security - To achieve desirable results, you need access to a sufficiently performant LLM. However, using models like GPT-4 can become a privacy and security concern when datasets are meant to be kept private.