QualEval: Qualitative Evaluation for Model Improvement

1Princeton University 2Allen AI 3Georgia Tech
Your personal data scientist. Insights that actually work

We introduce QualEval, a qualitative evaluation framework for improving and understaning predictions from large language models (LLMs).

How does it work?




QualEval takes as input your evaluation data i.e your inputs, your answers and model predictions from your LLMs. It then uses a combination of NLP and ML techniques to generate a dashboard that helps you understand your model better and improve it.


QualEval uses your evaluation data to first discover important attributes like sub-tasks and domains. QualEval then assigns these attributes to each example in your evaluation data by using a linear programming solver. QualEval then uses these assignments to generate a dashboard visualizing your model behavior and generates precise and actionable insights for model improvement.

When can I use QualEval?


In short, always :).

QualEval is general and is applicable to any task and any language model. We demonstrate QualEval on a wide variety of generative and classification tasks, including code generation, question answering, and dialogue. Importantly, we demonstrate how insights from QualEval can be used to improve model performance.

We list some example dashboards for different models and different tasks. QualEval generates high-quality attributes and faithfully presents interpretable and actionable insights.

QualEval dashboard for code generation with davinci-2




QualEval dashboard for code generation with davinci-3




QualEval dashboard for dialog summarization with davinci-2




QualEval dashboard for dialog summarization with davinci-3




QualEval dashboard for clinical knowledge based QA with curie




QualEval dashboard for clinical knowledge based QA with davinci-2