I’ve been exploring smolagents for a few weeks now. As a “words” person, I prefer that to the visual tools like N8N.

Firstly - it’s frustrating! The documentation is out of date, the notebooks don’t work, models get outdated. I guess that’s a side effect of how fast AI is moving right now - but it also shows that using this stuff in production is high risk! Breaking changes seem the rule, not the exception.

But when you get over that hurdle, it’s fascinating. Tools like Manus show a glimpse of the future - and smolagents shows how you might build such systems. In a team building workshop ages ago, we played the https://leadershipinspirations.com/wp-content/uploads/2018/07/PBJ-Challenge.pdf. In the game, one person must write out, step by step, how to make a sandwich; the other person must follow those steps - but had to pretend to be a robot, with no knowledge of the world (there’s a variant where the other person is blindfolded). The point of this is to build communication - don’t assume the other person has your knowledge, or see things the same way.

I think recent iterations of LLMs were a bit like that game, when played by two players without a strong shared understanding of the world. There was no end of example of simple prompts that would send the LLM a bit loopy. Basic maths, counting the number of occurrences of a letter in a word - they all felt like you’d asked someone to make you a sandwich who’d only a vague understanding of bread.

Newer LLMs are much better - presumably because they’ve had more and better training data, the algorithms have got better at extracting understanding from that data, and the humans building the LLMs have learned from the many hilarious mistakes posted on Twitter. We’re also seeing an increased specialization in LLMs - turns out that an LLM that writes code doesn’t need to know much about skateboards.

What we’re seeing with agentic architectures is multiple components working together, in unscripted ways, to solve problems. This is a rapidly evolving field, but I am exploring the following concepts, based on smolagents, and using the sandwich game to illustrate the concepts.

Planning agent

An agent that can come up with a high level plan to solve the problem. Usually an LLM. Given the “make a sandwich” request, it might come up with the following plan:

Instruction: Make a sandwich

Plan

  • Get two slices of bread
  • Butter the bread
  • Put slice of ham on one of the slices of bread
  • Put some sliced tomato on the ham
  • Put some pickle around the tomato
  • Put the other slice of bread on the tomato and pickle

This planning agent operates at a fairly high level, and only has to come up with a plan. But one key aspect is that it can learn from its failures.

If you executed the plan as is, it would fail - it can’t find the ingredients, it doesn’t know how to use a knife, doesn’t know how to move things around or understand the concept of “around the tomato”.

So then it would come up with other strategies - it might come up with a more detailed plan, or change the sequencing, or change the ingredients. It keeps trying until it makes a sandwich (or, at least, thinks it has!), or until it’s told to stop.

Tools

There are many, many tasks that we already know how to execute, for which we can just provide an implementation for the planning agent to use. This is much faster and cheaper (both in terms of resource usage and money) than have the LLM figure it out. For instance, if you want to add two numbers together, it’s much easier to define an “addition tool” than have an LLM figure it out from scratch. In the “sandwich” example, we might refine it as follows.

Instruction: Make a sandwich. You have access to a tool that can find things in your kitchen, and a knife tool

Plan

  • Get two slices of bread
    • Find two slices of bread
    • Find plate
  • Butter the bread
    • Find butter
    • Use knife to spread butter
  • Put slice of ham on one of the slices of bread
    • Find ham
  • Put some sliced tomato on the ham
    • Find tomato
    • Use knife to slice tomato
  • Put some pickle around the tomato
    • Find pickle
    • Use knife to spread pickle
  • Put the other slice of bread on the tomato and pickle

The agent would try to execute these steps - but you’d still not get a sandwich. Knowing where the bread is doesn’t magically move it to the worktop.

Code-writing agents

Some tasks depend heavily on context or situation, or arise as the planning agent develops it’s plan. Don’t forget, the instruction we’re giving the planning agent is “make a sandwich”; we may have an instinct as to what that requires, but the point of the agent is to figure it out for us. If we knew what the process is, it would be cheaper and faster (in execution time, at least) to just write the software directly in a traditional way.

So, a code writing agent is a component that can take in a requirement, and write code to fulfill that requirement, then execute that code.

So, assuming we have a robot in the kitchen, and we are programming it to make a sandwich:

Instruction: Make a sandwich. You have access to a tool that can find things in your kitchen, and a knife tool. You can write code to program a robot

Plan

  • Get two slices of bread
    • Find two slices of bread
    • Find plate
    • Write code to move plate onto counter
    • Write code to move bread onto plate
  • Butter the bread
    • Find butter
    • Write code to move butter to counter
    • Use knife to spread butter
  • Put slice of ham on one of the slices of bread
    • Find ham
    • Write code to move ham onto bread
  • Put some sliced tomato on the ham
    • Find tomato
    • Use knife to slice tomato
    • Write code to move sliced tomato onto bread
  • Put some pickle around the tomato
    • Find pickle
    • Write code to move pickle to counter
    • Write code to move knife around tomato, not across tomato
    • Use knife to spread pickle
  • Put the other slice of bread on the tomato and pickle
    • Write code to move slice of bread onto tomato and pickle

Tada: A sandwich, generated by AI

Agents all the way down…

The “planning agent” sounds like the conductor running the show. I used that model to explain the concept, but it’s much more complicated - some of the steps may be too complex for code writing agents to solve. Moving stuff with a robot, for instance, is hard. In fact, you might create special planning controllers to handle those tasks - a “plate moving agent”, with access to tools like “limb mover”, “machine vision tool” etc. So the most sophisticated agentic systems are composed of multiple agents, using both pre-built tools and custom code, collaborating to reach a goal.

Why this matters.

Making software is expensive. It’s become cheaper over time - much cheaper! - but economics mean that software caters to “largest common denominator” needs. There are many different types of people who write documents - novelists, lawyers, letter writers, bloggers, students - and they all have different needs. And within those groups, there are different types of people, in different states of mind - a novelist might want to write a quick first draft, or go through feedback from their editor, etc. And yet, the number of word processors on the market is tiny - I know of 4. There are a few dedicated tools aimed at specific groups - novelists have Scrivener, for instance. But no two novelists work the same way.

So, what if we could instruct an agent to assist with writing a document? It might create (or reuse) the user interface for writing in, monitor your writing, notice when you’re stuck and suggest how to move further, package the final work into the file format you need, and send it to its intended recipient. You’re not using a word processor, you’re writing a document.

Agents allow us to create tools for specific situations, and adapt to circumstances as they learn.

For instance, if you are planning to buy a house, you probably have some requirements that don’t neatly fit into the filter criteria on the real estate website. For instance, my ideal home would be less than 5 minutes from a beach or a forest, and walking distance to bookshops, specialist food retailers, and a few nice bars and restaurants. It would have nice views, and a garden that’s got good light. It would have wall space for my bookshelves, and it would be quiet. At least 3 bedrooms, and I’d want a good-sized, well-lit kitchen. I don’t mind if it’s a flat or a house. Without an agent, you’d start by researching locations first - but it would be a bit haphazard. How do you find a place near a beach or forest, with the kind of amenities I like? Once you’ve got a shortlist of locations, you’d go to the real estate listings sites, and use the basic filters - location, number of rooms, price - and start scrolling. You’d look at hundreds of properties, and make a short list.

Instead, imagine having an agent that can search for the things you’re interested in, and send you a daily shortlist; you could tell it what you do and don’t like about each option and it would refine its search.

Yes, you could write traditional software to find that property short list. It would require a fairly skilled developer, and days or weeks of effort - but it would not work for someone who wants a flat within 5 minutes of an underground station on the Central or Victoria lines, with South-facing windows. And as you discover what you do and don’t like, you’d likely have to get that developer to continuously change the code. In short - it’s not economically viable.

It’s still early - there are many challenges with agentic architectures. LLMs are getting better and better at (what looks like!) reasoning, but they’re far from perfect. More complex scenarios quickly run into limitations - for instance, the LLMs context window becomes too small to contain all the information needed. There are questions about how to ensure the code written by the code-writing agents is safe, let alone efficient. Many of the showcases (e.g. on Reddit) are more like smart workflows than true agents.

Nevertheless, progress with LLMs has been accelerating. The tooling for agentic systems is improving dramatically, and there are several great libraries of tools, agents, and LLMs available on Hugging Face. I can’t wait to see what’s next!

Updated: