# Writing 84 Tests for a Project With Zero Lines of Code
The llm-wiki project has 3,610 lines across 22 files. Every single one is a markdown file. There is no Python. No JavaScript. No compiled binary. The "source code" is English prose โ instructions that Claude reads and follows to build knowledge wikis from web research.
So how do you write tests for a program that is, technically, a document?
I figured it out. 84 structural assertions, 11 intentionally broken wiki fixtures, 5 behavioral evals via Promptfoo, and a GitHub Actions pipeline. As far as I can tell, nobody has used Promptfoo to test a Claude Code plugin before. Here is what I learned.
## The three-layer problem
Traditional testing has a simple contract: given input X, the function returns Y. If it doesn't, the test fails. But when your "function" is an LLM reading markdown instructions, the contract dissolves. The same instruction file, given the same user request, might produce different article titles, different file structures, different cross-references. The output is correct within a range, not at a point.
Anthropic, OpenAI, and GitLab all converge on the same solution: split tests into layers by how much uncertainty you're willing to tolerate.
**Layer 1 is deterministic and free.** No LLM calls. You're checking that the wiki's file system is internally consistent. Does every directory have an `_index.md`? Does every raw source have the six required frontmatter fields? Does the `type: articles` file actually live in `raw/articles/` and not `raw/papers/`? These checks take seconds and cost nothing. I have 84 of them. They run on every push.
**Layer 2 is semantic and costs money.** You ask Claude to do something โ ingest a URL, compile an article, route a command โ and then grade whether it followed the instructions. Promptfoo handles this with three assertion types: trajectory assertions ("did it call WebSearch?"), llm-rubric assertions ("does the output have complete frontmatter?" graded by a judge LLM), and custom JavaScript that checks the file system after the agent runs. Each eval costs about $0.50. I run five of them on PRs.
**Layer 3 is full workflows.** Research-to-article. Ingest-compile-lint. Retract-and-verify-cleanup. These use `claude -p` in headless mode, cost $10-20 per run, and execute weekly. I haven't built these yet. Layers 1 and 2 are live.
## The golden wiki
Every structural test needs something to test against. I built a golden wiki โ a minimal but complete fixture with three raw sources, two compiled articles, proper cross-references, bidirectional See Also links, correct index files, and a valid log. Twenty files total. It passes every check.
Then I broke it eleven different ways. One copy per lint rule. `missing-index/` has a deleted `_index.md`. `bad-frontmatter/` has `type: invalid` instead of `type: articles`. `misplaced-file/` puts a concept article inside `wiki/references/`. `retracted-marker/` leaves a `<!--RETRACTED-SOURCE-->` comment that should have been cleaned up. Each broken copy triggers exactly one violation. The test asserts that the defect is present โ negative testing.
A shell script called `http://generate-defect-fixtures.sh` creates all eleven from the golden wiki in under a second. Change the golden fixture, regenerate, and every negative test updates automatically.
## Promptfoo on a Claude plugin
Promptfoo has a provider called `anthropic:claude-agent-sdk` that can load local plugins. Point it at your plugin directory, whitelist the tools, set a budget cap, enable sandbox mode, and it runs your plugin through test cases defined in YAML.
The part that surprised me: the `skill-used` assertion type. You can assert that the agent invoked a specific skill โ not just that the output mentions wiki commands, but that Claude actually triggered the wiki skill at the Claude Code level. Combined with trajectory assertions that verify which tools were called, you can check both what happened and how.
I test five behaviors: the fuzzy router dispatching "Research the history of testing" to the research command, a URL to ingest, a question to query, an ambiguous single word triggering clarification (negative control), and the plugin loading without errors. Each runs three times to measure variance.
## What I actually learned
The biggest surprise: Layer 1 catches almost everything. The expensive behavioral evals in Layer 2 are for confidence, not coverage. Index corruption, frontmatter drift, misplaced files, broken cross-references โ these are the actual failure modes of a wiki management system, and they're all deterministic. You don't need an LLM to verify that a file exists in the right directory.
Anthropic's eval guide says "grade outcomes, not trajectories." For wiki operations, the outcome IS the file system state. Check the files, check the indexes, check the links. If the structure is correct, the agent followed the protocol. The trajectory โ which tool calls it made, in what order โ is interesting but secondary.
The test suite is at
in `tests/`. Clone it, run `./tests/test-structure.sh`, and watch 84 green checkmarks validate a project that contains zero lines of code.
GitHub
GitHub - nvk/llm-wiki: LLM-compiled knowledge bases for any AI agent. Parallel multi-agent research, thesis-driven investigation, source ingestion, wiki compilation, querying, and artifact generation.
LLM-compiled knowledge bases for any AI agent. Parallel multi-agent research, thesis-driven investigation, source ingestion, wiki compilation, quer...


Part 1 









