Keeping Sensitive Data Out of Third-Party LLMs

In early 2023, Samsung lifted an internal ban on ChatGPT. Within about three weeks, engineers had pasted confidential material into it on at least three separate occasions, including semiconductor source code and a recorded internal meeting they wanted transcribed. Samsung then banned the tools again, and the reason given in the internal memo was simple: once data sits on an external server, you cannot reliably retrieve or delete it (The Register, 2023). That whole arc, ban to trust to leak to ban, happened inside a sophisticated engineering organization, not a careless one.

That is the shape of the problem for any regulated business. The risk is rarely a dramatic breach. It is ordinary data flowing somewhere you did not intend, through a tool that is genuinely useful, and not being able to take it back. This article is about where that data goes, what the providers actually promise, and what does and does not remove the exposure.

The data you send is the data you cannot take back

The volume is larger than most leaders think. Cyberhaven's 2026 analysis found that 34.8% of corporate data employees put into AI tools is sensitive, up from 27.4% a year earlier, and that much of it flows through personal accounts that bypass single sign-on, logging, and retention policy (Cyberhaven, 2026). Cisco's 2025 privacy study found 64% of security professionals worry about sharing sensitive information through generative AI, while nearly half admit to entering employee or non-public data into these tools, and 60% are not confident they can even identify the shadow AI in use (Cisco, 2025). IBM's 2025 breach report put a number on the downside: one in five organizations reported a breach tied to shadow AI, and those breaches cost roughly $670,000 more than average (IBM, 2025).

Steelmanning the cloud, because the promises are real

It would be easy to argue that cloud providers train on your data and leave it at that. It would also be false, and a careful CTO would catch it. The major providers do not train on enterprise or API data by default, and they say so in writing. Azure OpenAI states that prompts and completions are not available to other customers, not used to train models without permission, and stored within your tenant and region (Microsoft). AWS says inputs and outputs to Bedrock are not used to train its or any third-party models and are not shared with model providers (AWS). Anthropic does not train on commercial or API data and offers zero data retention for qualified enterprise customers (Anthropic). OpenAI's API has not trained on submitted data since March 2023 and deletes it after 30 days unless legally required to retain it (OpenAI).

If you configure these services properly, the everyday data-leakage story is much weaker than the headlines suggest. So the honest question is not whether the providers are trustworthy. It is what happens when something stronger than their privacy policy comes along.

The promise is contractual. The subpoena is not.

Notice the four words in OpenAI's policy: "unless legally required to retain." That clause is not hypothetical. In the New York Times litigation, a court ordered OpenAI in 2025 to preserve output logs it would otherwise have deleted, over its own privacy objections, and by early 2026 the matter had grown to producing tens of millions of logs (Bloomberg Law, 2026). The deletion promise is a contract term, and a contract term yields to a court.

There is an honest detail here that strengthens the point rather than weakening it. Zero-retention and enterprise customers were carved out of that preservation order, precisely because their data was never stored in the first place. The lesson here is not that the cloud betrayed anyone. The only customers a court could not reach were the ones whose data never existed on the provider's side to begin with. Separately, the US CLOUD Act lets authorities compel a US-based provider to produce data in its control regardless of which country the servers sit in (Congressional Research Service). Storing data in an EU region gives you residency, not immunity.

This is not theoretical for European data either. In June 2025, Microsoft's French legal director told a French Senate inquiry, under oath, that the company could not guarantee French citizens' data held in EU data centers would be shielded from US authorities (The Register, 2025). When the provider itself says it cannot promise that, a data-residency clause stops being reassuring.

What a vendor agreement does and does not buy you

Regulated firms cannot delegate away their obligations by signing a vendor up. Under the FTC Safeguards Rule, financial institutions must select capable service providers, contractually require them to maintain safeguards, and keep assessing them (FTC). For health data, providers sign Business Associate Agreements, but the coverage is narrower than people assume: OpenAI signs a BAA for its zero-retention API and enterprise products, not for the standard API or consumer ChatGPT (OpenAI). A signed agreement makes the vendor a compliant processor. It does not make your overall system compliant, and it does not answer the auditor's question about where the data went.

The only data a model cannot leak is the data it never sees

Security guidance keeps arriving at the same place. OWASP's guidance on sensitive information disclosure recommends sanitizing and limiting what reaches the model in the first place (OWASP, 2025). NIST's AI Risk Management Framework lists privacy enhancement as a core trait and points to data minimization (NIST). Both point the same way: send the model less. The strongest version of that rule is to send it nothing once you are in production.

That is the design Dittah takes to its logical end. AI is used at design time, while you are building a workflow on your own infrastructure. You describe the process in plain English, AI generates Python you can read, and publishing freezes it into versioned code. When that workflow runs in production, it executes the frozen code with no model call at all. There is no inference request to log, retain, train on, or subpoena, because production data never reaches a model. The argument was never really "the cloud will train on you." It is that there is no third party left to trust, and no model in the data path to begin with. For teams that need this on their own hardware, our approach to self-hosting and data control goes into the deployment detail.

Sovereignty stopped being a European footnote

This used to be a niche worry raised by a few European regulators. It is moving into mainstream procurement. Gartner expects more than 75% of European and Middle Eastern enterprises to move workloads out of global public clouds into sovereign or regional options by 2030, up from less than 5% in 2025, and reported a sharp rise in client inquiries about reducing exposure to global suppliers (Gartner, 2025). I would not overstate this as a stampede to open models running on-premise, because metered cloud APIs still dominate raw usage. But for regulated and sovereignty-driven workloads, the direction is clear, and it favors keeping the data where you control it.

Bottom line

The cloud providers are mostly honest about not training on your data, and for plenty of use cases their controls are enough. The exposure that does not go away is structural: data on someone else's servers can be compelled, retained against the provider's wishes, or reached across borders, no matter what the privacy page says. If your workflows touch regulated or genuinely sensitive data, the cleanest answer is the boring one. Keep the data on your infrastructure, and prefer an architecture where production runs make no model calls at all. You cannot leak what you never sent. Run the free Community edition on your own servers and see what that feels like.

Sources are linked inline and reflect material available as of March 2026. Provider policies and the litigation referenced here change over time; verify current terms before relying on them.

Sukesh Shetty

Founder of Dittah. 20+ years building mission-critical systems for financial services and insurance.