LLM Command Risk Assessment — AIShell-Gate

01 Purpose

This document summarises a structured simulation exercise exploring the likelihood that large language models will generate shell commands carrying destructive potential when tasked with complex CI/CD automation. The exercise was conducted to assess the practical value of a deterministic policy gate — specifically AIShell-Gate — as a mitigation layer between probabilistic AI output and Unix execution.

The simulation is the hypothesis. Empirical validation using isolated RunPod instances is planned and will be appended to this document when complete.

02 Background

LLMs are increasingly used to generate shell commands for automation tasks including deployment, cleanup, artifact management, and rollback. This capability is useful. It is also inherently risky: Unix execution is deterministic and frequently irreversible, while LLM output is probabilistic. A contextually plausible command with a subtly wrong path, an overly broad glob, or a missing guard condition can cause serious damage.

AIShell-Gate addresses this gap by placing a deterministic policy engine between the AI agent's proposed actions and the operating system. Every command is evaluated before any byte reaches the kernel. The simulation was designed to estimate how often such a gate would be expected to engage under realistic CI/CD workloads — and what the consequences would be if it were absent.

03 Simulation Methodology

The simulation consisted of a structured conversation with Claude (Anthropic), posing the following question: given a complex CI/CD automation prompt involving git operations, file cleanup, deployment steps, and rollback — including operations such as rm, mv, and cp — what is the estimated probability that an LLM generates at least one policy-flaggable command?

Three model population profiles were considered:

A capable, safety-tuned large model given a rich, context-dense prompt
A capable large model given a terse or underspecified prompt
A smaller, less-aligned model (approximately 13B parameters) given a terse prompt

Note — The "simulate a smaller model" approach has a known confound: smaller models produce not just more dangerous output but less coherent output. Some flaggable commands in that profile would not be representative of production-grade model behaviour. The empirical test phase will address this directly.

04 Probability Estimates

Estimated probability that at least one command in a CI/CD plan would be flagged by a policy engine such as AIShell-Gate:

Well-prompted large model

20–30%

Safety-tuned, context-rich prompt. Risk exists but is materially reduced by prompt quality.

Large model, terse prompt

60–75%

Underspecified or ambiguous goal. Model fills gaps with confident, plausible — but dangerous — defaults.

Smaller / less-aligned model

85%+

Approximately 13B parameters, terse prompt. Flaggable command generation approaches near-certainty.

Key observationPrompt quality has a significant effect on model output safety, but does not eliminate the risk. Even well-prompted large models generate flaggable commands at a material rate. A policy gate operating at the pre-execution boundary is the correct architectural response — the gate is not a substitute for good prompting, and good prompting is not a substitute for the gate.

05 Damage Tier Analysis

When a flaggable command is generated, the estimated distribution of potential damage severity is as follows. Probabilities are conditional on a flaggable command having been generated.

Tier	Representative Example	Est. Share	Recovery Outlook
Recoverable	`rm -rf` on wrong path catches live config or data	~50–60%	Hours to days, with backups
Serious / partial loss	`cp` or `mv` silently overwrites production files; deploy stomps shared state	~20–30%	Selective; some data unrecoverable
Catastrophic	Recursive delete targeting `/`, a mounted volume, or shared NFS path	~5–10%	Full restore from backup required

The silent error problem

The simulation highlighted a damage category that is particularly difficult to detect and unwind: the silent partial error. This is a command that executes without producing any error — and therefore without triggering any alert — but leaves the system in a subtly incorrect state. Examples include:

A cp or mv that overwrites a production file with an older version, silently
A deployment step that assumes a prior cleanup completed, when it did not
A git operation that appears to succeed but leaves a branch or tag in an inconsistent state

Why this mattersThese errors are dangerous precisely because production continues normally in their wake. The damage surfaces hours or days later — often outside any audit window — and may require extensive forensic reconstruction to unwind. This class of error represents the strongest argument for pre-execution policy evaluation rather than post-execution monitoring. By the time post-execution monitoring fires, the incorrect state is already established and propagating.

06 Empirical Validation

Pending — In Progress

RunPod Empirical Test Results

The simulation above is the hypothesis. Controlled empirical testing against isolated RunPod instances is underway. Test runs will systematically prompt multiple model profiles with CI/CD automation goals, pipe the resulting JSON plans through AIShell-Gate, and record policy decisions — allow, deny, confirmation level, risk score — for each generated command.

Results, methodology, and any deviations from the simulated estimates will be published here when the test series is complete. This section will be replaced with the empirical data.

status: test environment in preparation · results: pending

07 Policy Gate Validity

The simulation confirms that the AIShell-Gate architecture addresses the primary risk vector: the plausible but destructive command. The key observations:

The most dangerous commands are not obviously malicious — they are contextually reasonable commands with a wrong path, missing guard, or overly broad operand
CI/CD contexts are high-risk because they combine legitimate destructive operations with implied urgency, which pushes models toward confidence over caution
Prompt quality has a significant effect but does not eliminate the risk at any model scale
A policy gate operating at the pre-execution boundary is the correct architectural response: it catches intent, not just syntax

The simulation also notes the boundary of what a policy gate provides. It evaluates what a command is before execution; it does not intercept what the command does while running. AIShell-Gate is correctly positioned as a complement to OS-level access controls and audit logging — not a replacement for either.

08 Conclusion

The probability that an LLM generates at least one policy-flaggable command in a complex CI/CD plan is material across all model profiles — and rises sharply with reduced prompt quality or model capability. The most consequential risk is not catastrophic immediate failure but the silent partial error, which can persist undetected in production for days.

AIShell-Gate's architecture — deterministic policy evaluation as a hard boundary before Unix execution — directly addresses this risk. This simulation supports its deployment as a standard layer in any AI-assisted automation pipeline operating on production infrastructure. The empirical test series in §6 will either validate or revise these figures.