Human feedback evaluation for RAG pipeline. (2024)

SuperKnowa framework for collection human feedback for the application of Retrieval-Augmented Generation (RAG) pipeline.

Himadri Talukder

Generative AI is a branch of artificial intelligence that focuses on generating new content, such as text, images, music, or other forms of data. However, ensuring the accuracy and reliability of these models remains a critical challenge. Collecting input from users can play a crucial role in assessing the model’s response and enhancing its performance and implementing for the Retrieval-Augmented Generation (RAG) pipeline. In this blog, we will explore how to collect user feedback on different LLM model responses, comparing the effectiveness of rating, ranking, comments, and QA questionnaires.

To implement the feedback mechanism, we focused on fine-grained human feedback on the basis of the following criteria,

Output influenced by lack of relevance, repetition, and inconsistency.
Generated output containing inaccurate or unverifiable information.
Generated response is missing or partial information.

To compare the effectiveness of these feedback mechanisms, we conducted an experiment utilizing three text generation models — flan-xxl, sandstone-coga-3.0, and coga-3.8 — within the IBM watsonx.ai platform. We asked to provide feedback using rating, ranking, comments, and QA.

Below, you’ll find explanations and intentions behind the various types of feedback we used.

Rating: Rating feedback involves assigning a numerical value to a model’s response based on its perceived quality or relevance. For instance, a user might rate a generated response on a scale of 1 to 3, indicating their satisfaction with the answer. It is quick and quantifiable, suitable for assessing overall quality.

In our analysis we used 3 star rating

Incorrect/hallucination
Partial answer
Acceptable as is

Human feedback evaluation for RAG pipeline. (3)

2. Ranking: Ranking feedback requires users to compare and rank different model responses according to their preference. This feedback mechanism provides a finer level of bias between responses and allows the model to learn the relative quality of outputs.

We are using drag and drop mechanism to rank the models based on the model’s response.

3. Q&A Questionnaires: Yes/no questionnaires present users with specific questions about the response. Targeted feedback on specific aspects, such as relevance to the topic. It can help to gather focused feedback on the accuracy of the response.

We asked three questions for every response, ensuring that the question align logically with the user-provided answers.

Is the response relevant and coherent?
Was this a useful response with an appropriate amount of information?
Is the response factual and accurate, based on the document?

Human feedback evaluation for RAG pipeline. (4)

4. Comments: Comment-based feedback lets users provide textual explanations for their preferences or suggestions for improvement. These comments offer valuable insights into the strengths and weaknesses of model responses.

We collect comments for each responses as additional feedback.

Drawbacks of these feedback types:

Rating: Quick and easy to collect, but might lack detailed feedback.
Ranking: Provides fine-grained discrimination, but can be more complex for participants.
Comments: Offers qualitative insights, but requires more effort to analyze.
QA Questionnaires: Provides focused feedback on specific aspects, but may not capture the depth of qualitative insights.

While each of these types have strengths and limitations, A combination of these feedback mechanisms can provide a comprehensive understanding of response quality and user preferences. Ratings, rankings, comments, and yes/no questionnaires can collectively enhance the learning process of models response. This hybrid approach leverages the strengths of each mechanism and addresses their respective limitations.

Here’s the feedback we collected altogether, and we can use it to dig deeper and analyze it further.

{
 "additional_feedback": [
 {
 "model_id": "flan-xxl",
 "feedbackBy": "htalukder@ibm.com",
 "answer": "You can perform the following tasks if you have the business role purchaser: Search and view sources of supply based on different criteria, such as material, supplier, material group, plant, purchasing organization, purchasing document category, quota arrangement, and validity period. View the relevance of different sources of supply Create or edit sources of supplyNote When you click create and choose a source of supply, you will be navigated to the respective app to provide the details for the creation of the purchasing document. Check the default source assignment of a purchase requisition for a specific material, plant, and date combination Choose Simulate SoS Assignment Enter the relevant material, plant, and date Choose Go The system will display the default sources of supply. If multiple sources of supply exist and if the system cannot determine a single default source of supply, then all sources of supply will be shown.",
 "feedback": {
 "relevent": "yes",
 "appropriate": "yes",
 "factual_accurate": "yes"
 }
 },
 {
 "additional_feedback": [
 {
 "model_id": "sandstone-coga-3.0",
 "feedbackBy": "htalukder@ibm.com",
 "answer": "With this app you can view and manage existing sources of supply for a material for a defined validity period.",
 "feedback": {
 "relevent": "yes",
 "appropriate": "no",
 "factual_accurate": "yes",
 "comment": "Answer was correct but not enough information"
 }
 }
 ]
 },
 {
 "additional_feedback": [
 {
 "model_id": "coga-3.8",
 "feedbackBy": "htalukder@ibm.com",
 "answer": "With this app you can view and manage existing sources of supply for a material for a defined validity period.",
 "feedback": {
 "relevent": "yes",
 "appropriate": "no",
 "factual_accurate": "yes",
 "comment": "partially correct."
 }
 }
 ]
 }
 ],
 "answers": [
 {
 "model_id": "flan-xxl",
 "answer": "You can perform the following tasks if you have the business role purchaser: Search and view sources of supply based on different criteria, such as material, supplier, material group, plant, purchasing organization, purchasing document category, quota arrangement, and validity period. View the relevance of different sources of supply Create or edit sources of supplyNote When you click create and choose a source of supply, you will be navigated to the respective app to provide the details for the creation of the purchasing document. Check the default source assignment of a purchase requisition for a specific material, plant, and date combination Choose Simulate SoS Assignment Enter the relevant material, plant, and date Choose Go The system will display the default sources of supply. If multiple sources of supply exist and if the system cannot determine a single default source of supply, then all sources of supply will be shown."
 },
 {
 "model_id": "sandstone-coga-3.0",
 "answer": "With this app you can view and manage existing sources of supply for a material for a defined validity period."
 },
 {
 "model_id": "coga-3.8",
 "answer": "With this app you can view and manage existing sources of supply for a material for a defined validity period."
 }
 ],
 "feedbackBy": "htalukder@ibm.com",
 "feedbackDate": "2023-08-04T16:45:25.921Z",
 "question": "How can I manage sources of supply?",
 "rankig": [
 {
 "rank": 0,
 "model_id": "flan-xxl"
 },
 {
 "rank": 1,
 "model_id": "sandstone-coga-3.0"
 },
 {
 "rank": 2,
 "model_id": "coga-3.8"
 }
 ],
 "rating": [
 {
 "star": 3,
 "model_id": "flan-xxl"
 },
 {
 "star": 2,
 "model_id": "sandstone-coga-3.0"
 },
 {
 "star": 2,
 "model_id": "coga-3.8"
 }
 ],
}

We created an admin dashboard using the collected feedback to represent the model performance to make a better decision.

Human feedback evaluation for RAG pipeline. (5)

Human feedback evaluation for RAG pipeline. (6)

Human feedback evaluation for RAG pipeline. (7)

Conclusion:

Collecting feedback through rating, ranking, and QA enables models to learn from human preferences and insights. While each feedback mechanism has its merits, a combination of all four can provide a holistic view of response quality and user preferences. We can use these feedbacks for the application of RAG pipeline, Reward model, for the application of RLHF or statistical analysis for building more accurate, relevant, and user-friendly LM models.

The full implementation details can be found in this GitHub Repo.

Follow Towards Generative AI for more content related to latest in AI advancement.

Human feedback evaluation for RAG pipeline. (2024)

SuperKnowa framework for collection human feedback for the application of Retrieval-Augmented Generation (RAG) pipeline.

Conclusion:

References