Using an AI Assistant to Read Tool Documentation

December 8, 2025 · 1002 words · 5 min

Using new tools on the command line can be frustrating. Even if we are confident that we’ve found th

Using new tools on the command line can be frustrating. Even if we are confident that we’ve found the right tool, we might not know how to use it. A typical workflow might look something like the following. Can we improve this flow using LLMs? Docker provides us with isolated environments to run tools. Instead of requiring that commands be installed, we have created minimal Docker images for each tool so that using the tool does not impact the host system. Leave no trace, so to speak. Man pages are one of the ways that authors of tools ship content about how to use that tool. This content also comes with standard retrieval mechanisms (the tool). A tool might also support a command-line option like . Let’s start with the idealistic notion that we should be able to retrieve usage information from the tool itself. In this experiment, we’ve created two entry points for each tool. The first entry point is the obvious one. It is a set of arguments passed directly to a command-line program. The OpenAI-compatible description that we generate for this entry point is shown below. We are using the same interface for every tool. The second entrypoint gives the agent the ability to read the page and, hopefully, improve its ability to run the first entrypoint. The second entrypoint is simpler, because it only does one thing (asks a tool how to use it). Let’s start with a simple example. We want to use a tool called to generate a for a link. We have used our image generation pipeline to package this tool into a minimal . We will now pass this prompt to a few different LLMs; we are using LLMs that have been trained for tool calling (e.g., GPT 4, Llama 3.1, and Mistral). Here’s the prompt that we are testing: Note the optimism in this prompt. Because it’s hard to predict what different LLMs have already seen in their training sets, and many command-line tools use common names for arguments, it’s interesting to see what LLM will infer before adding the page to the context. The output of the prompt is shown below. Grab your phone and check it out. When an LLM generates a description of how to run something, it will usually format that output in such a way that it will be easy for a user to cut and paste the response into a terminal: However, if the LLM is generating tool calls, we’ll see output that is instead formatted to be easier to run: We respond to this by spinning up a Docker container. Running the tool as part of the conversation loop is useful even when the command fails. In Unix, there are standard ways to communicate failures. For example, we have codes, and streams. This is how tools create feedback loops and correct our behavior while we’re iterating at the terminal. This same mechanism can be used in a conversation loop involving an LLM. To illustrate, here is another simple example. We’ll try running a tool with the following prompt. In our test, this did fail. However, it also described the apparent issue on the output stream. By including this message in the conversation loop, the assistant can suggest different courses of action. Different LLMs produced different results here. For example, Llama 3.1 gives instructions for how to install the missing font. On the other hand, GPT 4 re-ran the command, but only after having made the “executive” decision to try a different font. We are very early in understanding how to take advantage of this apparent capacity to try different approaches. But this is another reason why quarantining these tools in Docker containers is useful. It limits their blast radius while we encourage experimentation. We started by creating a pipeline to produce minimal Docker images for each tool. The set of tools was selected based on whether they have outputs useful for developer-facing workflows. We continue to add new tools as we think of new use cases. The initial set is listed below. There was a set of initial problems with context extraction. Only about 60% of the tools we selected have pages. However, even in those cases, there are usually other ways to get help content. The following steps show the final procedure we used: Using this procedure, every tool in the list above eventually produced documentation. Limited context lengths impacted some of the longer manual pages, so it was still necessary to employ standard RAG techniques to summarize verbose man pages. Our tactic was to focus on descriptions of command-line arguments and sections that had sample usage. These had the largest impact on the quality of the agent’s output. The structure of Unix pages helped with the chunking, because we were able to rely on standard sections to chunk the content. For a small set of tools, it was necessary to traverse a tree of help menus. However, these were all relatively popular tools, and the LLMs we deployed already knew about this command structure. It’s easy to check this out for yourself. Ask an LLM, for example: “What are the subcommands of Git?” or “What are the subcommands of Docker?” Maybe only popular tools get big enough that they start to be broken up into subcommands. We should consider the active role that agents can play when determining how to use a tool. The Unix model has given us standards such as pages, streams, and codes, and we can take advantage of these conventions when asking an assistant to learn a tool. Beyond distribution, Docker also provides us with process isolation, which is useful when creating environments for safe exploration. Whether or not an AI can successfully generate tool calls may also become a metric for whether or not a tool has been well documented. To follow along with this effort, check out the .