Scores & Evaluation

Evaluation is a critical aspect of developing and deploying LLM applications. Usually teams use a multitude of different evaluation methods depending on the use case and the stage of the development process. Langfuse provides a flexible scoring system to capture these evaluations and make them actionable.

Common evaluation methods

Langfuse supports all forms of evaluation methods due to its open architecture and API, learn more on the score object in the next section.

Depending on your needs in the development process, you can use one or multiple of the following evaluation methods. Each method will vary in: frequency, cost, and quality. usually only a combination of multiple methods will provide a comprehensive evaluation of the LLM application.

Source	examples
Manual Annotation (in UI)	Establish an evaluation baseline by manually annotating data in the UI
User feedback	Explicit (e.g., thumbs up/down, 1-5 star rating) or implicit (e.g., time spent on a page, click-through rate, accepting/rejecting a model-generated output, human-in-the-loop)
Model-based evaluation	Langfuse-managed evals or external library: OpenAI Evals, Whylabs Langkit, Langchain Evaluators (cookbook), RAGAS for RAG pipelines (cookbook), custom model outputs
Custom via SDKs/API	Run-time quality checks (e.g. valid structured output format), custom link to annotation tools, external evalutation pipelines that sync with Langfuse.

Score object in Langfuse

Scores serve as an object to store evaluation metrics in Langfuse. They always are releated to a trace and can be attached to specific observations within a trace.

Attribute	Type	Description
`name`	string	Name of the score, e.g. user_feedback, hallucination_eval
`value`	number	Value of the score
`traceId`	string	Id of the trace the score relates to
`observationId`	string	Optional: Observation (e.g. LLM call) the score relates to
`comment`	string	Optional: Evaluation comment, commonly used for user feedback, eval output or internal notes
`id`	string	Unique identifier of the score. Auto-generated by SDKs. Optionally can also be used as an indempotency key to update scores.
`source`	string	Automatically set based on the souce of the score. Can be either `API`, `EVAL`, or `ANNOTATION`

Using scores across Langfuse

Scores can be used in multiple ways across Langfuse:

Displayed on trace to provide a quick overview
Segment all execution traces by scores to e.g. find all traces with a low quality score
Analytics: Detailed score reporting with drill downs into use cases and user segments

Frequently used scores

Scores in Langfuse are adaptable (it is just a name) and designed to cater to the unique requirements of specific LLM applications. They typically serve to measure the following aspects:

Quality
- Factual accuracy
- Completeness of the information provided
- Verification against hallucinations
Style
- Sentiment portrayed
- Tonality of the content
- Potential toxicity
Security
- Similarity to prevalent prompt injections
- Instances of model refusals (e.g., as a language model, ...)

This flexible scoring system allows for a comprehensive evaluation of various elements integral to the function and performance of the LLM application.

Scores & Evaluation

Common evaluation methods

Score object in Langfuse

Using scores across Langfuse

Frequently used scores

Was this page useful?

Questions? We're here to help

Subscribe to updates