Recently we have seen a spate of papers (Auto gen, SWE agent) and tools such as DataButton, Devin etc., that show promising capabilities in autonomous software engineering.
In this post we will unravel the most basic primitives of how these autonomous agents work, and their impact on software engineering as a profession.
This post is a logical progression from my previous articles where we discussed Prompt engineering , talking to data, code generation experiments, and agents in Langchain. In the post on code generation we had generated neural nets by iteratively talking to LLM and nudging it to generate correct code, akin to working with Github copilot. This needed skills in both prompt engineering as well as good foundation in how neural net architectures work. The Agents post was also about choosing a tool from a list of tools and then using it based on the problem, where agents could even choose multiple tools dynamically. These posts are good to revisit, as they represent the genesis of this new trend around multi-agents based intelligence which is gathering more momentum.
These are AI based components that leverage the power of LLMs to understand context of a project’s goals and processes, generate solutions such as code or text, and even collaborate with each other to validate the outputs. This is akin to how a real human team works together in an organization to solve problems.
Although there are accuracy concerns, the productivity boost that can be achieved in having multiple agents decompose complex tasks, and collaborate with each other to create the final solution can be huge. And there can always be a lean human oversight in loop to address any gaps. This has potential to upend the predominant business model of the current day- that of deploying large swathes of engineers to hand craft code and software deliverables.
Andrew Ng founder of deeplearning.ai has written about the patterns of agent behavior. Here, we will look at some of these patterns and explore how they can be used in context of software engineering.
Reflection is a pattern that is typically composed of two agents. One agent generates software artefact such as code and another agent reflects and checks that for correctness. This can be constructed like a loop where the generator agent and reflector agent iteratively improve the outputs. We will be doing a tradeoff of some compute cycles here in return for better quality output, as reflection is a slower thinking process.
Reflexion is a technique where actor agent explicitly grounds its criticism in external data. It can call out citations as well as gaps in generated output.
LATS (Language agent tree search) uses a combination of reflection and search. It uses backpropagation of reflective and environment-based feedback for an improved search process. E.g. explicitly write unit tests and check generated code, giving them scores before selecting.
Tool use has also evolved a lot since we wrote about in sometime ago. In my previous post on agents, I showed how the LLM is smartly choosing between say web search agent and math calculator for answering a question. Now these can go much further, let’s say you ask a mathematics question and yet don’t provide a math calculator. Now the LLM may generate Python code to codify the math problem and improve its probability of getting correct answer. This is surely a good leap in its cognition.
LLM agents can now be combined in sequence, parallel, hierarchical structure or even a combination of these styles. At time of designing the LLM interaction, the developer can indicate which agents can actually delegate to other agents and which agents have to work alone.
Platforms such as crewAI follow a human teaming metaphor and allow detailed configurations such as delegation ability.
LLMs are known to hallucinate a lot when making API calls, in its use of input arguments. Recent work such as Berkley’s Gorilla project show that LLMs now fed with API documents and API signatures can actually make successful API calls. This is huge, as modern software systems are typically composed of stringing together API calls to multiple external systems along with internal logic.
Guardrails to prevent hallucination, cross-checking by other agents, human oversight, can be specified by clearly describing roles, goals, agent backstories, and clear task descriptions on quality expectations.
In conclusion, Agents are getting smarter. They can flexibly communicate and collaborate with each other. This trend will take off and we can expect some super AI driven software engineering process adoption. Small language models will also explode with high quality textbook grade data to do niche work that can be used in atomic agentic workflows versus only large models with internet grade data.