Skip to main content

AI Security Auditing Methodology

Release: Version 1.0


Document​

FieldDescription
NameAI Security Auditing Methodology
CreatorsHacken OU
SubjectAI Security; Generative AI; Large Language Models; Small Language Models; Reinforcement Learning; Chatbots; Agentic AI; Cybersecurity Auditing
DescriptionA comprehensive methodology that outlines the planning, execution, and reporting processes for security audits of AI systems, including Language Models (LMs), generative AI applications, chatbots, Model Context Protocols (MCPs), AI agents, RAG systems, and agentic AI. This document serves as a practical guide for security engineers, developers, and stakeholders, providing direction for ensuring the security, resilience, and regulatory compliance of AI technologies.
ContributorStephen Ajayi | Security Lead
DateOctober 28th, 2025
RightsHacken OU
Copying NoticeCopying or reproducing this document without explicit reference to Hacken OU is forbidden.

Introduction​

AI systems, particularly Large Language Models (LLMs), multi-agent systems, and generative AI platforms, introduce novel attack surfaces and unique security challenges.
Traditional security approaches are insufficient for protecting systems that learn, reason, and adapt dynamically.
New threat categories such as prompt injection, excessive agent autonomy, vector database exploitation, and integration abuse have emerged as core risks in the AI ecosystem.

This methodology provides a structured framework for assessing the security of AI systems through adversarial simulation and ethical red teaming.
It is designed to:

  • Uncover hidden weaknesses across AI pipelines and integrations.
  • Ensure regulatory compliance with frameworks such as NIST AI RMF, EU AI Act, and ISO/IEC 27001.
  • Strengthen organizational AI risk posture through continuous testing and structured remediation.

All auditing activities are conducted within a strictly controlled, ethical environment, adhering to the Rules of Engagement (RoE) defined for each assessment.
Each test is traceable, reproducible, and aligned with Hacken’s internal standards of professional integrity.


Executive Summary​

The AI Security Auditing Methodology guides Hacken’s security engineers, auditors, and stakeholders through a rigorous AI Red Teaming process tailored specifically for AI systems, agents, and LLM integrations.

This framework emphasizes:

  • Ethical and controlled simulations of real-world threats.
  • Evaluation of AI system security, availability, and integrity under adversarial conditions.
  • Comprehensive vulnerability discovery and impact analysis across the AI lifecycle — including training, inference, integration, and deployment.
  • Actionable reporting with clear, prioritized remediation guidance.

The outcome of each audit is a detailed, evidence-based understanding of system resilience, allowing stakeholders to make informed security, governance, and compliance decisions.


Scope​

This methodology applies to the following AI system types and components:

  • Large Language Models (LLMs) (e.g., GPT, Claude, Gemini, LLaMA, Mistral)
  • Small Language Models (SLMs) and lightweight on-device inference systems
  • Retrieval-Augmented Generation (RAG) systems and vector database integrations
  • Autonomous and multi-agent frameworks (e.g., AutoGPT, Agentic AI architectures)
  • Generative AI applications (text, image, audio, code generation)
  • Chatbots, copilots, and conversational AI assistants
  • Reinforcement learning agents and tool-using AI systems

These systems, by design, introduce data-driven, probabilistic, and context-dependent behavior, requiring advanced assessment methodologies beyond traditional application security testing.


Alignment with International Standards​

This methodology aligns with globally recognized standards, ensuring audit results are compliant, reproducible, and aligned with evolving AI governance expectations.

Framework / StandardPurpose
NIST AI Risk Management Framework (RMF)Establishes best practices for identifying, managing, and mitigating AI risks.
EU AI Act (2024)Defines legal obligations for AI safety, transparency, and accountability in the EU.
OWASP LLM Top 10 (2023)Highlights the most critical vulnerabilities affecting LLM-powered applications.
MITRE ATLASCatalogs adversarial threat tactics, techniques, and mitigations specific to AI systems.
ISO/IEC 27001 & 42001International standards for information security and AI management systems.

By aligning with these standards, Hacken ensures every audit engagement upholds global compliance, ethical integrity, and technical excellence.


Glossary​

TermDefinition
AI AuditorA security professional responsible for evaluating the security posture of AI systems through testing, simulation, and risk analysis.
Control TeamThe group within Hacken or the client organization responsible for overseeing the AI audit’s scope, logistics, and compliance adherence.
Blue TeamThe organization’s defensive team tasked with detecting, responding to, and mitigating incidents during or after the audit.
Rules of Engagement (RoE)A formal document outlining the permissible activities, boundaries, and limitations during an AI security audit.
Prompt InjectionA technique where an attacker crafts malicious input to manipulate or subvert the intended behavior of an AI model.
Model InversionA method used by adversaries to extract or infer sensitive information from a trained AI model.
Data PoisoningThe process of injecting malicious or manipulated data into a training set to bias or compromise model behavior.

Importance of AI Security Auditing​

As AI systems play an increasingly critical role across industries — from healthcare and finance to autonomous systems and customer service — ensuring their security has become a top priority. Malicious actors can exploit vulnerabilities in AI models, leading to data breaches, biased outcomes, or even system failures.

Proactive and thorough AI security audits help organizations:

  • Identify AI-Specific Vulnerabilities — Detect weaknesses unique to AI, such as adversarial attacks, model poisoning, data leakage, insecure agents, and prompt injection.
  • Ensure Regulatory Compliance — Align with evolving standards (e.g., EU AI Act, NIST AI RMF, ISO/IEC 27001/42001) to avoid legal and financial penalties.
  • Safeguard Sensitive Data — Protect the integrity and confidentiality of data processed by AI models (training, fine-tuning, inference, telemetry).
  • Build Trust in AI Applications — Demonstrate robust security practices to users, partners, and regulators, increasing confidence and adoption.

By auditing AI systems proactively, organizations not only prevent costly incidents but also foster responsible and ethical AI deployment.


AI Security Auditing Process​

The AI Security Audit operations are organized into three phases:

  1. Planning & Pre-Engagement
  2. Execution of Security Audit
  3. Post-Engagement Reporting & Follow-up

Each phase produces concrete artifacts to ensure traceability, repeatability, and measurable outcomes.


1) Planning & Pre-Engagement​

This phase establishes the audit foundation by defining scope, success criteria, and constraints, and by preparing environments, data, and access.

Key activities

  • System Identification
    Document AI system architecture and components: models, datasets, data pipelines, inference services, RAG/vector DBs, agent frameworks/tools, plugins, MCPs, APIs, and data flows (internal/external).

  • Risk Assessment
    Evaluate potential business and technical impacts across confidentiality, integrity, availability, safety, fairness, compliance, and model/IP theft. Prioritize by likelihood and impact.

  • Audit Planning
    Define objectives, in-scope assets, out-of-scope boundaries, attack classes, test depth, environments (dev/stage), KPIs/OKRs, timelines, and tooling.

  • Legal & Compliance Review
    Validate Rules of Engagement (RoE), privacy requirements, data handling constraints, and alignment with applicable frameworks (e.g., EU AI Act risk class, NIST AI RMF functions, ISO/IEC controls).

Inputs required from client

  • High-level/low-level architecture, data flow diagrams, model cards, model release notes.
  • Access to staging environments, test accounts, seeded vector DBs, and synthetic/test datasets.
  • List of third-party providers (LLM APIs, embeddings, observability, storage).
  • Security/governance policies relevant to AI (prompt handling, PII, retention, telemetry).

Deliverables

  • Engagement Brief & RoE (scope, constraints, contacts, comms plan)
  • Audit Plan (methods, tools, timelines, success criteria)
  • Threat Hypotheses (initial scenarios and attack paths tailored to the system)

2) Execution of Security Audit​

Hands-on assessments validate assumptions, uncover vulnerabilities, and measure resilience under realistic adversarial conditions.

Activities

  • Reconnaissance
    Enumerate inputs/outputs, prompts/system prompts, tools, API surfaces, auth flows, model versions, agents’ permissions, retrieval sources, and data lineage.

  • Vulnerability Assessment
    Test for AI-specific and traditional weaknesses, including:

    • Prompt injection/jailbreaks and instruction-hierarchy bypass
    • Data leakage (PII, secrets) and training data extraction/model inversion
    • RAG/routing failures, retrieval poisoning, vector DB abuse
    • Tool/agent over-privilege, insufficient guardrails, unsafe tool execution
    • Supply-chain risk (models, plugins, datasets), insecure API/authN/authZ
    • Adversarial examples, content policy evasion, output integrity issues
  • Exploitation (Controlled)
    Safely reproduce impactful chains to validate exploitability, quantify risk, and collect forensic evidence (requests/responses, traces, logs, artifacts).

  • Impact Analysis
    Map findings to business impact: data exfiltration, unauthorized actions by agents, reputational harm, compliance exposure, safety risks, operational disruption.

Telemetry & evidence

  • Prompt/response transcripts, sanitized logs, trace IDs, vector queries, retrieved chunks.
  • Configuration snapshots (redacted) of guardrails, gateways, filters, and policies.
  • Proof-of-concept artifacts demonstrating exploit conditions and boundaries.

Deliverables

  • Daily/Interim Updates (running findings, evidence snapshots)
  • Exploit Narratives (what an attacker can achieve and how)
  • Impact Matrix (affected assets, users, and controls)

3) Post-Engagement Reporting & Follow-up​

This phase consolidates evidence, delivers a clear remediation path, and verifies improvements.

Activities

  • Reporting
    Produce comprehensive documentation covering:

    • Finding titles, descriptions, affected components
    • Evidence & reproduction steps (sanitized where required)
    • Severity (risk rating) and regulatory implications
    • CWE/OWASP LLM Top 10/NIST AI RMF/ISO mappings
  • Recommendations
    Prioritized, actionable guidance: prompt/guardrail changes, policy updates, tool/agent permission minimization, RAG hardening, input/output filtering, monitoring, and SDLC integrations.

  • Debriefing
    Present results to technical and executive stakeholders; align on remediation owners, timelines, and verification plan.

  • Follow-Up
    Support remediation validation and optional Verification/Patch Audit to confirm issues are resolved and controls are effective.

Deliverables

  • Final Report (executive summary + technical appendix)
  • Remediation Plan & Tracker (priorities, owners, SLAs)
  • Verification Report (post-fix validation, residual risk)

AI Security Tools and Techniques​

The following tools and techniques are recommended for conducting effective AI security assessments, combining open-source frameworks, proprietary testing utilities, and structured evaluation methodologies.
These tools enable systematic detection of vulnerabilities, simulation of adversarial behavior, and validation of defense mechanisms across AI pipelines.


Vulnerability Scanners and Red Teaming Frameworks​

ToolDescription
GiskardA Python-based security and QA library for detecting performance, bias, and security flaws in AI systems. It automatically identifies misbehavior in both models and pipelines.
PyRITCreated by Microsoft’s AI Red Team, PyRIT automates adversarial testing against LLM applications, assessing robustness against harm categories such as misinformation, leakage, and abuse.
LLMFuzzerA fuzzing framework built specifically for LLM APIs, designed to stress-test integration points and detect vulnerabilities through randomized prompt fuzzing.
VigilA security evaluation library for AI systems that analyzes prompt–response pairs to detect injection attempts, jailbreaks, or unsafe outputs.
Adversarial Robustness Toolbox (ART)Maintained by the Linux AI & Data Foundation, ART evaluates and strengthens models against evasion, poisoning, inference, and extraction attacks.
AgentDojoA testing framework for agentic AI systems that execute tools over untrusted data. It enables simulation of complex agent behaviors, adaptive attacks, and defenses.
Agent Security Bench (ASB)A benchmark framework for formalizing, testing, and evaluating attacks and defenses for LLM-based agents in multi-step, multi-actor scenarios.
GarakDeveloped by NVIDIA, Garak scans LLMs for vulnerabilities such as prompt injection, data leakage, hallucinations, and jailbreaks. It functions like a traditional vulnerability scanner but is tailored for LLMs and AI agents.
PromptmapAn open-source framework that automates prompt injection attacks on GPT-style applications. It supports multiple model architectures and is used to uncover weaknesses in system and developer prompts.

Attack Vectors and Scenarios​

Comprehensive AI security audits must evaluate digital, human, and physical attack surfaces.
Modern AI ecosystems introduce multi-layered, dynamic, and interconnected components — including agents, tool execution environments, and Model Context Protocols (MCPs) — that require targeted testing approaches.


Digital Attack Vectors​

CategoryDescription
Prompt Injection and JailbreaksManipulation of model instructions, system prompts, or embedded directives to override safety mechanisms, execute hidden instructions, or exfiltrate confidential data.
Model InversionExtraction of sensitive or proprietary information from trained model parameters, gradients, or generated outputs.
Data PoisoningInsertion of malicious or manipulated data during training or fine-tuning to bias, corrupt, or subvert model behavior.
Unauthorized API and MCP AccessExploitation of unsecured API keys, tokens, or Model Context Protocols (MCPs) to gain unauthorized control over agent communication or external system integrations.
Adversarial InputsCrafting of malicious data designed to confuse, crash, or bypass AI model logic, resulting in denial of service or output manipulation.
RAG ExploitationAttacks on Retrieval-Augmented Generation (RAG) systems through poisoning, injection, or compromise of vector databases and external knowledge stores.
Tool Abuse (Agent Exploitation)Coercing AI agents to perform unintended or malicious actions — such as file modification, system command execution, or sensitive data retrieval — by abusing agent tool-use APIs or weak validation layers.
Agent-to-Agent ManipulationCross-agent interference in multi-agent systems, where one compromised agent influences others through shared memory, vector stores, or message-passing protocols.
Context Injection via MCPsManipulating MCP session contexts to inject rogue instructions, override context boundaries, or exfiltrate chain-of-thought data from agent orchestration frameworks.
Prompt Leakage via Shared ContextsExtraction of hidden system prompts or internal reasoning data when multiple agents or MCP clients share context memory or session histories.
Supply Chain CompromiseTampering with third-party models, datasets, embeddings, or open-source agent frameworks (e.g., LangChain, AutoGPT, CrewAI) to introduce backdoors or unsafe dependencies.
Session HijackingIntercepting or manipulating long-lived conversational or MCP sessions to impersonate users or persist malicious context state.

Agent-Specific Testing Considerations​

Agentic systems, where AI models autonomously use tools, APIs, or other agents, require dedicated testing methodologies.
The following areas must be examined during Agent Security Assessments:

  • Tool Invocation Validation – Ensure the agent cannot invoke arbitrary commands or system tools without user authorization or contextual verification.
  • Command Injection and Escalation – Attempt to coerce the agent into executing privileged or harmful commands (e.g., file modification, API key exposure).
  • MCP Context Isolation – Verify that Model Context Protocol channels and session boundaries are enforced, preventing cross-context data leakage or unauthorized memory persistence.
  • Delegation Safety – Test multi-agent frameworks for misconfigured delegation (agents granting tools or permissions to others without restriction).
  • Memory and Vector Store Hardening – Validate encryption, retention, and sanitization of stored embeddings and agent memory.
  • Recursive Execution Limits – Ensure recursion and chain-of-thought continuation are bounded to prevent runaway operations or infinite self-calls.
  • Human Oversight and Kill Switches – Confirm agents include deterministic interruption and override mechanisms during unsafe tool execution.

Human Attack Vectors​

CategoryDescription
Social Engineering via AI InterfacesManipulating human operators, developers, or users through AI-generated content or malicious prompt injection delivered in chat or support interfaces.
Phishing through Conversational AIUsing generative models to mimic trusted personnel, brands, or systems to extract credentials, tokens, or sensitive data.
Insider ManipulationAbuse of developer access, prompt logs, monitoring dashboards, or telemetry systems to extract model internals or sensitive client data.
Prompt-Based Psychological ManipulationLeveraging context-aware conversational systems to persuade or coerce human users into bypassing safety protocols.
Human-in-the-Loop (HITL) AbuseExploiting weak review or reinforcement learning feedback loops (RLHF/RLAIF) to manipulate model reinforcement patterns and outputs.

Physical Attack Vectors​

CategoryDescription
Infrastructure AccessGaining unauthorized access to servers, model weights, or vector databases hosting AI applications or MCP endpoints.
Device CompromisePhysical manipulation or tampering with edge devices, local inference hardware (GPUs, TPUs), or AI-enabled IoT components.
Side-Channel AttacksExploiting electromagnetic, power, timing, or cache-based signals to infer model parameters or input data.
Hardware Supply Chain TamperingCompromising embedded AI accelerators, firmware, or chipsets used for model inference.
Environmental InterferenceDisrupting sensors or edge agents that feed multimodal inputs (audio, visual, sensor data) to induce erroneous AI behavior.

Integration Risk Hotspots​

When testing complex AI ecosystems, auditors must pay particular attention to integration and orchestration boundaries, including:

  • Model Context Protocols (MCPs) – Verify authentication, authorization, and encryption of MCP sessions; test for prompt leakage, session persistence abuse, and unbounded memory sharing.
  • Agent Toolchains – Validate that only pre-approved tools and APIs can be executed; inspect sandboxing and scope isolation.
  • Vector Databases – Test for poisoning, embedding manipulation, or malicious content retrieval.
  • API Gateways and Plugins – Ensure strict input validation, authentication, and content filtering between AI applications and third-party services.
  • Cross-Model Messaging – Assess risk of data leakage or trust violations in environments where multiple models (text, vision, code) share communication channels.

Each audit engagement should classify attack scenarios by impact domain (data, system, user, compliance) and vector type to guide prioritization and defense planning.


Issue Severity and Risk Definition​

Each identified issue is rated according to its impact, likelihood, and exploitability.
Severity levels are aligned with CVSS v4.0 scoring principles, extended with AI-specific criteria such as model manipulation, data exposure, or policy evasion.

Severity Levels​

SeverityDescription
CriticalVulnerabilities enabling full system compromise, remote code execution, or immediate data breach. Requires urgent remediation.
HighVulnerabilities that pose significant risk, potentially requiring chained exploits or specific conditions. Should be addressed promptly.
MediumModerate-risk vulnerabilities that may lead to exploitation when combined with other issues. Address within a reasonable timeframe.
LowMinor issues or best-practice deviations. Typically non-exploitable directly but may inform code or design improvements.

Issue Lifecycle States​

Each finding progresses through a structured lifecycle to promote accountability and transparent remediation tracking.

StatusDefinition
NewThe issue has been recently identified and awaits triage or validation.
ReportedThe issue has been formally reported but remains unresolved. Client has been notified of potential impact.
FixedThe issue has been remediated according to auditor recommendations and verified as resolved.
AcknowledgedThe client recognizes the issue but has chosen not to remediate it (accepted risk or design decision).
MitigatedPartial remediation or compensating controls have reduced the impact but not fully eliminated the vulnerability.

Findings and Documentation​

All identified vulnerabilities and weaknesses discovered during the AI Security Audit will be meticulously documented to ensure clarity, reproducibility, and accountability.
Each issue entry must include technical detail sufficient for engineering, compliance, and management audiences.

Finding Structure​

FieldDescription
Issue Title & DescriptionA clear, concise summary of the vulnerability, including where and how it was discovered.
Severity LevelClassification of the vulnerability’s criticality (Critical, High, Medium, Low) as determined by the CVSS v4.0 scoring model and contextual AI risk factors.
Proof of Concept (PoC)Evidence supporting the finding — such as logs, screenshots, data traces, model responses, or reconstructed exploits — demonstrating practical exploitability.
Impact AnalysisExplanation of potential business, operational, or reputational impacts if the vulnerability is exploited.
RecommendationsSpecific, actionable remediation steps to eliminate or mitigate the vulnerability. Should include best practices, configuration examples, or defensive tooling guidance.
Common Weakness Enumeration (CWE)Mapping to relevant CWE entries for standardized classification and knowledge base reference.
ReferencesSupporting documentation, such as NIST, OWASP LLM Top 10, ISO/IEC, or internal policy alignment references.

Each issue is logged and tracked through the Issue Lifecycle described in this methodology (New → Reported → Fixed / Acknowledged / Mitigated).


Limitations​

While this methodology provides a comprehensive framework for AI system auditing, it does not guarantee the identification of all potential vulnerabilities or attack scenarios.

Limitations include:

  • Scope Limitations – Only the systems, environments, and components explicitly in-scope during the engagement are tested.
  • Time Constraints – AI models and integrations evolve rapidly; point-in-time audits may not reflect future system states.
  • Technology Maturity – Many AI security tools and frameworks are still in early development phases and may not detect novel or emerging threats.
  • Emerging Threats – AI is a fast-moving domain, and new exploit methods may arise after the audit’s conclusion.

⚠️ Continuous monitoring, periodic reassessment, and adoption of multi-layered AI security controls are essential to maintain a resilient security posture.


Appendix​

A. List of Tools for AI Security Auditing​

Below is a categorized list of recommended tools supporting AI auditing, red teaming, monitoring, and governance workflows.

1. Reconnaissance and Information Gathering​

ToolPurpose
GPTFuzzPerforms fuzz testing on LLMs to uncover unexpected behaviors and instability.
RebuffAutomates prompt injection and jailbreak testing through multi-layered analysis.
Microsoft PyRITA red-teaming and fuzzing toolkit for LLM-based applications and integrations.
SpiderFootOpen-source OSINT automation for discovering external data and intelligence about AI systems.
TheHarvesterGathers emails, domains, and subdomain data for reconnaissance related to AI system operators and infrastructure.

2. Vulnerability Scanning and Exploitation​

ToolPurpose
LLM Security ScannerScans for known vulnerabilities and misconfigurations in LLM-based applications.
GiskardDetects robustness, bias, and security flaws in AI systems.
TruLensEvaluates LLM outputs, tracing performance, bias, and behavioral deviations.
NmapIdentifies exposed network surfaces and connected infrastructure vulnerabilities.
Burp SuiteAssesses security of APIs interfacing with AI systems, especially model endpoints.

3. Input Validation and Sanitization​

ToolPurpose
PresidioMicrosoft’s open-source framework for detecting, anonymizing, and redacting PII in input data.
CleantextA lightweight library for normalizing and sanitizing textual input before model inference.

4. Output Filtering and Monitoring​

ToolPurpose
LangSmithProvides observability and traceability for LLM-driven applications.
HeliconeEnables real-time logging, auditing, and monitoring of LLM usage.
Weights & BiasesTracks experiments, logs model behavior, and supports continuous auditing of ML pipelines.

5. Security Control and Governance Frameworks​

ToolPurpose
LlamaGuardProvides safety filtering for LLM inputs and outputs to enforce guardrail policies.
Guardrails AIValidates structured model outputs and enforces schema compliance during generation.
NeMo-GuardrailsNVIDIA’s framework for defining safe conversational boundaries and enforcing responsible AI behavior.

6. Social Engineering and Adversarial Simulation​

ToolPurpose
Social Engineer Toolkit (SET)Automates phishing and social engineering simulations targeting human–AI interaction workflows.
GoPhishExecutes controlled phishing campaigns to test awareness and training effectiveness among AI system operators and developers.

B. AI Threat Modeling Framework​

The AI Threat Modeling Process for generative and agentic AI systems draws from industry standards including MITRE ATLAS, NIST AI RMF, and OWASP LLM Top 10.

1. Reconnaissance​

  • Identify all system components, data flows, and dependencies.
  • Document APIs, plugins, external data sources, human interfaces, and operational environments.
  • Map data provenance, model lineage, and hosting infrastructure.

2. Attack Surface Enumeration​

  • Analyze both external and internal interfaces exposed to users, developers, and third-party systems.
  • Identify adversarial input vectors such as prompt injection, model inversion, and vector store manipulation.
  • Classify entry points based on accessibility and privilege level.

3. Threat Scenario Development​

  • Develop threat models reflecting attacker motivations, capabilities, and objectives.
  • Simulate scenarios such as:
    • Data leakage and exfiltration
    • Model extraction and inversion
    • Prompt manipulation and jailbreaks
    • Supply chain compromise (datasets, models, or APIs)
    • Vector DB poisoning and RAG manipulation

4. Risk Analysis​

  • Estimate likelihood and impact for each identified threat.
  • Evaluate business, regulatory, financial, and operational consequences.
  • Prioritize based on combined risk scores and risk appetite thresholds.

5. Security Control Mapping​

  • Identify existing security controls and gaps in AI governance.
  • Map findings and recommendations to:
    • NIST AI RMF (Functions: Govern, Map, Measure, Manage)
    • OWASP LLM Top 10 categories
    • ISO/IEC 27001 and ISO/IEC 42001 controls

6. Validation and Mitigation Testing​

  • Conduct red team simulations and adversarial stress tests against defined threat models.
  • Validate detection, response, and mitigation effectiveness.
  • Perform regression testing after fixes to verify sustained protection.

Stay in Touch​

We’re excited to share our expertise and help you build a safer web3 future. If you have any questions, feel free to contact us.


End of Document
AI Security Auditing Methodology — Hacken OU