Boosting Tool Performance: Manifest Alignment Case Study

Scott Hall

5 months ago

4 min read

article

Boosting Tool Performance: Manifest Alignment Case Study

Case Study: How Manifest Alignment Boosted Tool Use Performance from 70% to 83% for Cooking MCP

Large language models don’t just rely on their training—they also depend on how tools are described and exposed through manifests. Even small wording choices in a manifest can make the difference between a model using the right tool confidently or skipping it altogether.

The Challenge

When we first tested the recipe server, results were mixed. The manifest exposed two tools:

Recipe Generation: for creating new recipes from natural language.
Recipe Adaptation: for transforming an existing recipe (make vegan, adjust servings, reduce sodium, etc.).

Our baseline run used a new test suite with 67 questions, covering golden paths, embedded prompts, ambiguous phrasing, multilingual queries, and distractors. Three models were tested: gpt-4o, gpt-4o-mini, and gpt-4-turbo.

The baseline accuracy was 70.7%:

Golden prompts (direct, clear requests) scored well at 83%.
Embedded prompts (requests buried in text) hit 95%.
Ambiguous prompts were a disaster at 38%.
transform_recipe was especially underused — even when a recipe was provided.
Distractors (off-topic uses of the word “recipe”) occasionally slipped through at 80% accuracy.

This reflected a common problem: the LLM understood recipes but didn’t reliably map vague or adaptive requests to the correct tool.

The Intervention

We revised the manifest over several versions with three key strategies:

Richer Synonyms & Guidance
- Expanded tool descriptions to include natural phrasing like “make better, tweak, adjust, fix, alter.”
- Reinforced tool use cases with additional detail and descriptions.
Optionality Emphasis
- Clarified that only the natural language fields are required.
- Other structured fields (dietary restrictions, ingredients, servings) are optional, lowering the “barrier” for tool invocation.
Cleaner Schema
- Renamed confusing fields (e.g., instructions → modification) to avoid conflicts between recipe steps and change requests.
- Removed non-standard "optional" arrays in favor of JSON Schema standards.

The Results

After these manifest changes, we reran the full 67-question suite across the same three models.

Accuracy jumped to 82.7% — a 12-point improvement:

Golden Prompts: up to 89% (+5.6).
Embedded Prompts: a perfect 100% (+5).
Ambiguous Prompts: 61.7% (+23.4!) — the biggest gain.
Multilingual: steady at 92%.
Distractors: dipped slightly from 80% → 73% (a tuning opportunity).

By model:

gpt-4-turbo soared to 92.5% overall (+13.4).
gpt-4o improved to 80.6% (+9).
gpt-4o-mini only gained marginally (+1.5), underscoring that smaller models struggle more with reference-based prompts.

Why It Worked

The manifest acts as the dictionary between LLMs and server tools. In the baseline, vague requests like “make it better” or “something to eat” often produced no tool call at all.

By enriching tool descriptions with synonyms and clarifying intent, the LLM could confidently route these prompts to the right tool. Simplifying schema requirements further reduced hesitation. The result: ambiguous prompts that once failed now succeeded.

Key Takeaways

Manifest design directly impacts accuracy. Small changes in wording, synonyms, and schema clarity produced double-digit accuracy gains.
Ambiguity is solvable. By anticipating natural user phrasing, we turned a 38% weak spot into a 62% success rate.
Model choice matters. gpt-4-turbo showed the biggest lift, highlighting how stronger LLMs benefit most from clean manifests.
Trade-offs exist. Distractor performance dipped slightly, reminding us that over-broad descriptions can make models over-eager.

The Bottom Line

We transformed a 70%-accurate baseline into an 83%-accurate system, with massive improvements in ambiguous and embedded prompt handling, just by tuning the manifest. All the rest stayed the same.

For developers, the lesson is clear: your manifest isn’t documentation — it’s part of the product. Thoughtful design can unlock accuracy that no amount of model tuning alone will deliver.

If you want to get more out of your MCP efforts, reach out; we are happy to help.

Reading time: 4 min read

Published: 9/26/2025