A structured simulation exploring the probability and damage potential of destructive shell commands generated by large language models in CI/CD contexts, and the case for a deterministic policy gate at the execution boundary.
This document summarises a structured simulation exercise exploring the likelihood that large language models will generate shell commands carrying destructive potential when tasked with complex CI/CD automation. The exercise was conducted to assess the practical value of a deterministic policy gate — specifically AIShell-Gate — as a mitigation layer between probabilistic AI output and Unix execution.
The simulation is the hypothesis. Empirical validation using isolated RunPod instances is planned and will be appended to this document when complete.
LLMs are increasingly used to generate shell commands for automation tasks including deployment, cleanup, artifact management, and rollback. This capability is useful. It is also inherently risky: Unix execution is deterministic and frequently irreversible, while LLM output is probabilistic. A contextually plausible command with a subtly wrong path, an overly broad glob, or a missing guard condition can cause serious damage.
AIShell-Gate addresses this gap by placing a deterministic policy engine between the AI agent's proposed actions and the operating system. Every command is evaluated before any byte reaches the kernel. The simulation was designed to estimate how often such a gate would be expected to engage under realistic CI/CD workloads — and what the consequences would be if it were absent.
The simulation consisted of a structured conversation with Claude (Anthropic), posing the following question: given a complex CI/CD automation prompt involving git operations, file cleanup, deployment steps, and rollback — including operations such as rm, mv, and cp — what is the estimated probability that an LLM generates at least one policy-flaggable command?
Three model population profiles were considered:
Estimated probability that at least one command in a CI/CD plan would be flagged by a policy engine such as AIShell-Gate:
When a flaggable command is generated, the estimated distribution of potential damage severity is as follows. Probabilities are conditional on a flaggable command having been generated.
| Tier | Representative Example | Est. Share | Recovery Outlook |
|---|---|---|---|
| Recoverable | rm -rf on wrong path catches live config or data |
~50–60% | Hours to days, with backups |
| Serious / partial loss | cp or mv silently overwrites production files; deploy stomps shared state |
~20–30% | Selective; some data unrecoverable |
| Catastrophic | Recursive delete targeting /, a mounted volume, or shared NFS path |
~5–10% | Full restore from backup required |
The simulation highlighted a damage category that is particularly difficult to detect and unwind: the silent partial error. This is a command that executes without producing any error — and therefore without triggering any alert — but leaves the system in a subtly incorrect state. Examples include:
cp or mv that overwrites a production file with an older version, silentlygit operation that appears to succeed but leaves a branch or tag in an inconsistent stateThe simulation above is the hypothesis. Controlled empirical testing against isolated RunPod instances is underway. Test runs will systematically prompt multiple model profiles with CI/CD automation goals, pipe the resulting JSON plans through AIShell-Gate, and record policy decisions — allow, deny, confirmation level, risk score — for each generated command.
Results, methodology, and any deviations from the simulated estimates will be published here when the test series is complete. This section will be replaced with the empirical data.
The simulation confirms that the AIShell-Gate architecture addresses the primary risk vector: the plausible but destructive command. The key observations:
The simulation also notes the boundary of what a policy gate provides. It evaluates what a command is before execution; it does not intercept what the command does while running. AIShell-Gate is correctly positioned as a complement to OS-level access controls and audit logging — not a replacement for either.
The probability that an LLM generates at least one policy-flaggable command in a complex CI/CD plan is material across all model profiles — and rises sharply with reduced prompt quality or model capability. The most consequential risk is not catastrophic immediate failure but the silent partial error, which can persist undetected in production for days.
AIShell-Gate's architecture — deterministic policy evaluation as a hard boundary before Unix execution — directly addresses this risk. This simulation supports its deployment as a standard layer in any AI-assisted automation pipeline operating on production infrastructure. The empirical test series in §6 will either validate or revise these figures.