Top 12 Papers on Agentic AI Governance (part 2)
Novel impacts, risks and guardrails | Edition #20
Hey đ
Iâm Oliver Patel, author and creator of Enterprise AI Governance.
This free newsletter delivers practical, actionable and timely insights for AI governance professionals.
My goal is simple: to empower you to understand, implement and master AI governance.
For more frequent updates, follow me on LinkedIn.
Welcome to the 20th edition of Enterprise AI Governance! đ
For those that have been here since edition 1 in December 2024, thank you! And to all who have joined more recently, thank you too! We are building an engaged and inspiring community of nearly 5,000 AI governance professionals, and it has been extremely fulfilling for me to share my work with you all. Looking forward to the next 20 editions and beyond! đ
This weekâs newsletter continues the series covering the top papers on agentic AI governance. Given the breadth of excellent work in this emerging field, this will now be a 3-part series instead of a 2-part series, with the final edition dropping next week. I will then be dropping my Agentic AI Governance Breakdown exclusively on Enterprise AI Governance after that.
This post features papers on AI agent hijacking evaluations, what we can learn from the economic theory of principal-agent theory, and what supporting infrastructure is required for agentic AI governance.
Disclaimer: I am not sponsored by any of the below organisations or authors and I receive nothing in return for promoting these papers. All the credit belongs to the original authors and the sources are provided as hyperlinks đ
Top 12 Papers on Agentic AI Governance (part 2)
6. Governing AI Agents
Noam Kolt - January 2025
Jensen Huang, CEO of Nvidia, famously remarked that âthe IT department of every company is going to become the HR department of AI agentsâ. HR policies, processes, and practices are one of the main ways in which human workers are âgovernedâ, âmanagedâ, and âoverseenâ. Similarly, as increasingly autonomous AI agents are performing tasks and delivering work which previously only humans could do, they will also need to be governed and managed, through mechanisms and interventions like technical safety guardrails, action permissions and restrictions, human oversight interfaces, and orchestration layers. Given the technical nature of this governance, the âIT departmentsâ have a critical role to play and can perhaps even learn from their HR colleagues.
This prescient and unique paper from Noam Kolt provides a detailed and theoretically rich exploration of Huangâs predictive analogy. However, it provides a note of caution. Its core argument is that many of the reasons why human âagentsâ behave and perform in a manner which is aligned with the interests of their principals (i.e., employer organisations) will most likely not apply and extend to AI agents. Therefore, the way in which we govern AI agents must differ significantly from how we govern human agents.
Kolt relies on two distinct but related theoretical frameworks for his analysis: 1) the economic theory of principal-agent problems and 2) the common law doctrine of agency relationships.
Principal agent problems are situations where one party (the human principal) recruits another party (the human agent) to perform work on their behalf, but issues and conflicts can arise if the self-interest of the agent clashes with the interest of the principal. Organisations deal with this in various ways, such as via salaries, bonuses, stock options, and other performance incentives. The common law doctrine of agency establishes legal rights and responsibilities for both the agent and principal, to govern and underpin such relationships.
The thrust of Koltâs argument is that there are serious limits as to how much these theoretical frameworks can help us when it comes to designing and implementing agentic AI governance. He argues that âthe conventional mechanisms for addressing agency problems developed by lawyers and economists might not be effective in governing AI agentsâ, due to the fundamentally different nature of human and AI agents.
For example, AI agents do not care about incentives like salaries, bonuses, or stock options, because they do not have self-interest which leads them to seek or prioritise financial gain. Similarly, AI agents will likely not care about enforcement and sanctions like being penalised, suspended, terminated, or even shut down. These core drivers of human behaviour simply do not apply to agents which lack self-interest or human-like motivations.
Therefore, alternative techniques are required to maintain alignment between human interests and agentic AI behaviour. Kolt begins to sketch this out, emphasising the importance of inclusivity (i.e., agents that are designed to promote a broad set of human interests and values), visibility (i.e.,effective tracking and transparency of AI agents), and liability (i.e., making humans responsible and liable for the harmful actions of AI agents).
The bottom line: although the HR and IT analogy is catchy and thought-provoking, if IT departments are really going to become the HR department for AI agents, they have got their work cut out. This is because Kolt has eloquently demonstrated that, due to the fundamental differences between human agents and AI agents, we cannot rely on the same approach to govern these entities. Therefore, those tasked with agentic AI governance will need to think outside of the box and get creative.
7. Strengthening AI Agent Hijacking Evaluations
U.S. AI Safety Institute technical staff - January 2025
Note: the U.S. AI Safety Institute has recently become the Center for AI Standards and Innovation (CAISI), remaining part of NIST. Its mission is to âserve as industryâs primary point of contact within the U.S. government to facilitate testing and collaborative research related to harnessing and securing the potential of commercial AI systemsâ.
In the first part of my series, two papers from OWASP were analysed, both of which outlined security threats and mitigations for agentic and multi-agentic AI systems. One of these threats is âagent hijackingâ, which refers to actors using deceptive, malicious, or harmful prompts and commands to promote damaging actions and shape agentic behaviour in a disruptive manner.
For example, agent hijacking attacks can lead to agents being manipulated and executing harmful or malicious code, carrying out cyber attacks, sending problematic communications, and exfiltrating data. The more powerful agentic capabilities become, the greater the risk if hijacking attacks succeed.
The NIST team explain that agent hijacking is a type of âindirect prompt injection attackâ. In the context of generative AI, a prompt injection attack is when an attacker manipulates the input prompt (or prompts) provided to an LLM in order to influence the output response that the LLM provides. This technique can be used to extract personal or confidential data from the model, or to circumvent its safety and content moderation guardrails. In the context of agentic AI, a similar logic is adopted, but the goal is to shape and influence the actions that the agent performs.
The below image from NIST highlights why AI agents are vulnerable to hijacking attacks. It is because, for agents to function, they must receive both âtrusted developer instructionsâ, provided and hardcoded by those who built the system, as well as âtask-relevant dataâ, provided by users who are prompting or instructing. Crucially, this data is provided to the system in a âunified inputâ. This means that it can sometimes be challenging for an agent to distinguish between legitimate instructions from its developers and malicious inputs provided by a threat actor posing as a âregularâ user.
Image credit: Strengthening AI Agent Hijacking Evaluations, NIST CAISI
The researchers performed a series of evaluations to assess the risk of AI agent hijacking and to determine how susceptible different agents were to such attacks. By sharing malicious instructions with the agents as a small part of larger legitimate instructions, it was possible to track how often these attacks succeed.
One of the key findings is that the success rate of agent hijacking attacks varies significantly depending on the specific task which the attacker is attempting to manipulate the agent into performing. For example, it is much easier to hijack agents to encourage them to send benign emails or execute malicious scripts than it is to manipulate them into sending phishing emails or exfiltrating large amounts of data. Therefore, AI agent hijacking evaluations need to be granular and task specific.
Moreover, multiple attack attempts should be performed for each task, as the attack success rate was found to increase as the number of attempts increases. This is also a more realistic simulation of real-world attackers.
The bottom line: enterprises worldwide are eager to adopt agentic AI and embed it across their business processes and activities. However, these powerful agentic capabilities, which will undoubtedly be transformative for companies, can also be used against you. Itâs crucial to anticipate how malicious actors could attempt to shape and manipulate agentic behaviour, before you entrust your most critical processes and actions to AI.
8. Infrastructure for AI Agents
Alan Chan et al., Centre for the Governance of AI - January 2025
This paper argues that to deploy and govern agentic AI effectively, a new underpinning infrastructure will need to be built and advanced. The âagent infrastructureâ which the authors advocate for the development of is defined as âtechnical systems and shared protocols external to agents that are designed to mediate and influence their interactions with and impacts on their environmentsâ.
âExternal to agentsâ is the key phrase in this definition. This is not about engineered safety features and guardrails that are part of the agentic or multi-agentic system itself. Rather, agent infrastructure refers to the wider ecosystem of systems, tools, and mechanisms that enable us to oversee, manage, and control the performance and behaviour of AI agents at scale.
Three core categories of agent infrastructure are outlined: 1) attribution, 2) interaction, and 3) response. I provide a summary of each category and the associated infrastructure below.
1) Attribution
Purpose: infrastructure which enables us to accurately determine what agents can and cannot do, as well as which actions agents have and have not performed. This includes:
Agent ID, so that each agent has a unique persistent identifier, which means its actions can be traced back to it.
Agent certificate, to serve as the equivalent of an AI model card for AI agents, outlining key information like core capabilities, tool use, data access, and authorised actions.
Identity binding, referring to mechanisms which link agents with legal entities who are responsible and accountable for their actions, such as individual users or organisations.
Image credit: Infrastructure for AI Agents, Centre for the Governance of AI
2) Interaction
Purpose: infrastructure which enables us to manage how agents interact and collaborate with other agents and human users. This includes:
Oversight layers, so that humans have an effective interface to monitor agents, approve high-risk actions, and intervene if necessary.
Inter-agent communication, referring to the suite of mechanisms and protocols for information sharing and coordination between agents, which can be useful in elevating the collective intelligence of agents and their ability to withstand and mitigate threats.
3) Response
Purpose: infrastructure which enables us to detect, respond to, and mitigate problems, incidents, and damaging actions performed by agents. This includes:
Incident reporting, so that agents collect, collate, and share relevant information about incidents, which is then processed and addressed in a timely manner.
Rollbacks, to ensure that certain actions erroneously performed by agents can be reversed or voided, to prevent real-world damage.
The bottom line: whilst most research to date focuses on the impacts and risks of agentic AI, and how these can be mitigated via system design, safety engineering, technical guardrails, and policies, this paper makes a valuable contribution by extending the analysis to the wider ecosystem of external tools, systems, and mechanisms required for effective governance.
9. Fully Autonomous AI Agents Should Not be Developed
Margaret Mitchell et al., Hugging Face - February 2025
There are no prizes for guessing what this paper argues: that fully autonomous AI agents should not be developed. A key theme throughout the literature on agentic AI governance is that with increased autonomy comes increased risk. This is exacerbated by the fact that increased autonomy also entails greater potential economic benefits, which creates strong incentives to reduce human involvement and oversight in a vast array of business activities.
Although the nature of human oversight will have to changeâwith direct human-in-the-loop no longer being feasible for each agentic actionâthe authors posit that some level of human oversight will always be essential, and that âsemi-autonomousâ agents are preferable to fully autonomous agents.
A broad spectrum of agentic autonomy is presented (see image below), with different levels representing how much an agent can do independently of humans.
Image credit: Fully Autonomous AI Agents Should Not be Developed, Hugging Face
The paper also outlines the core risks of agentic AI, covering themes such as performance, accuracy, safety, privacy, security, misuse, and agent hijacking. It argues that, for the majority of risk themes, increased agentic autonomy means increased risk. And that this risk level becomes unacceptable when agents are âfully autonomousâ.
For example, with respect to accuracy risks, if there is no human oversight or control, severe errors stemming from the compounding effect of cascading hallucinations can result in agentic actions which are fundamentally âunaligned with human goalsâ. Similarly, from a safety perspective, given the unpredictable, proactive, and non-deterministic nature of agentic AI, with full autonomy and no human control, an AI agent might design processes or techniques which enables it to bypass critical safety guardrails in pursuit of its goals.
The bottom line: this paper serves as a valuable reminder that although there will be strong economic incentives to adopt increasingly autonomous AI agents, the less control we humans have, the greater the risks.






