In the previous parts of our AI Multi-Agent series, we explored why ChatGPT alone does not constitute true AI and how to build a simple “Poor Man’s RAG” agent that utilizes context before interacting with you. Now, it’s time to delve into the intricate design of a general AI agent, a concept built upon more than 50 years of AI research. A recommended resource for serious AI designers is the classic book “Artificial Intelligence: A Modern Approach.”
In this article, we will examine the anatomy of a functional and efficient AI multi-agent, broken down into three key subchapters. We will also discuss practical components you can use for each part of the agent.
Since we are trying to design something intelligent, humans serve as the best model. Here is a scheme of a typical human:
Similarly, an AI agent must have analogous components. Can ChatGPT fulfill all these roles? Of course not, it can only simulate conversation. However, it can serve as a crucial part of a multi-agent system modeled after a human:
This is a very general scheme, but it captures the high-level design of pretty much any AI agent we can imagine — self-driving car, autonomous robot, agent on the web, etc.
A well-designed AI agent integrates several critical components to function effectively. It begins with a communication interface akin to a human's ability to speak or listen, facilitating interaction with users through chat or voice interfaces. Sensors, such as cameras and lidars for autonomous vehicles or combinations of LLMs and image-to-text models for web-based agents, enable the agent to perceive and understand its environment. A processing module synthesizes inputs from sensors and users to comprehend tasks at hand. A memory or knowledge base stores and retrieves pertinent information, enhancing decision-making through retrieval augmented generation (RAG). A robust reasoning engine plans, critiques, refines, and formulates execution strategies. Finally, actuators execute actions in the external world, enabling the agent to interact with software systems and physical environments effectively.
These AI multi-agents and their design have been studied for ages, e.g. in the book I recommended at the beginning of this article. Invention of LLMs and other Generative AI models makes it possible to build upon this research and implement it at the new level of technology, and finally start building AI agents that are useful in a general sense, as opposed to for specialized extremely narrow tasks.
Let’s move from our human analogy to discussing the specific parts needed to build a versatile AI Multi-Agent operating on the internet.
Sensors: These need to:
To build these sensors, we need a combination of LLMs and models that understand images, along with software that can crawl and download data from URLs. Proper prompt engineering will enable these sensors to convert what they “read” into formats suitable for further processing.
Human Interaction. This is the part everyone is familiar with by now thanks to ChatGPT — you can type what you want, or say it with words, and it will be processed by another LLM-based module for further reasoning.
Process Input. This, also LLM-based, module takes whatever Sensors give it together with the current request from a Human, and tries to formulate a clear Task Request for our Plan Solution module — arguably, the main part of the “brain”. It is also absolutely critical for this module to consult the Knowledge Base via RAG and make it part of the context when formulating the Task Request.
Knowledge Base / RAG. This module stores the data and knowledge that may be relevant to our agent’s operations. This can be all kinds of publicly available data accessed via “regular” internet search, as well as so-called Vector Databases, which represent unstructured text via numerical vector embeddings. This provides a much faster search “by meaning” as opposed to simply “by keywords” and is a crucial part of our Agent.
Plan Solution. This is normally a bunch of different LLMs working together, as it has to take the Task Request, analyze what kind of resources (and “hands”) are available to the agent to execute this request, iteratively plan such an execution using various critique / step-by-step planning approaches, design sub-agents that are currently missing but needed for the task execution, and finally orchestrate execution using the “hands” or sub-agents available to our agent. This is an extremely interesting and fast-developing area of AI research, and we at Integrail focus lots of resources on making an “AI Brain” that uses “self-learning” and automatic “subagent development”. In some other, more specialized agents (e.g., in games), approaches such as Reinforcement Learning are quite useful as well.
“Hands” or Execute Actions. All of the above would be completely useless if our agent didn’t have “hands” to do something that a human asked of it. These hands are pieces of code that can call external APIs, press buttons on web-pages, or in some other way interact with existing software infrastructure.
All of the above is possible to design and create today, and times could not be more exciting for this activity. We at Integrail are building not just such agents but also a platform to design and build them very easily without any programming knowledge. If you want to participate or follow us on this journey — do subscribe to this blog, we are constantly sharing our progress and giving away GenAI Token Credits to our friends :)