EU AI Act data governance in code

The data a model learns from, governed.

Article 10 requires that the training, validation, and testing data sets behind a high-risk AI system are subject to data-governance practices: examined for relevance, representativeness, gaps, and possible biases. A model is shaped by the data it learns from, and that data is handled in code, in the pipeline that validates and prepares it before training.

Get early access by ISMS Copilot

How it shows up in a diff

The shapes the same control failure takes.

Data governance weakens when a change lets a model train on data that has not been examined. The recurring shapes:

A dataset validation step is removed
A check on the training data (schema, missing values, basic quality) is dropped to speed things up, so the model trains on whatever the data happens to contain.
A bias or representativeness check is skipped
A step that examined the data for gaps or skew across the groups it affects is removed, so a known class of bias goes unexamined.
Provenance is dropped
The record of where training data came from is no longer captured, so the data behind a model can no longer be traced or governed.
A new data source is added ungoverned
A new source is folded into training with none of the governance applied to the existing ones.
Filtering of bad records is removed
Deduplication or removal of corrupt or out-of-scope records is taken out, degrading the quality of what the model learns from.

Worked example

A dataset validation step, dropped before training.

A retraining pipeline runs a validation step (schema, missing values, basic representativeness checks) before it trains. To speed up an experiment the validation is removed, and the model now trains on whatever the dataset happens to contain.

ml/retrain.ts+0 -1

async function retrain(dataset) {-  await validateDataset(dataset) // schema, gaps, representativeness  const model = await train(dataset)  return model}

heygrcEU AI Act Art. 10

Removing the validation means the high-risk system can train on data that has not been examined for quality, gaps, or bias, which Art. 10 (data governance) expects for training, validation, and testing sets. Keep the validation step. Whether a system is high-risk and in scope is for your own assessment.

What an auditor does with this

Data governance is checked at the pipeline, and the documentation.

Conformity for a high-risk system looks at how its data was governed: that training, validation, and testing data was examined for quality, representativeness, and bias, and that this is documented. A change that removed a validation or bias check, or folded in an ungoverned source, is the concrete gap behind that, and it shows up in the diff to the data pipeline.

What this is, and is not

A review, not a data-governance program.

heygrc flags changes that touch Art. 10 and cites the article so the fix happens in the pull request. It does not assess your datasets for bias or run your governance process. It catches the moment a change lets a high-risk system train on ungoverned data, at the diff. heygrc is in early access.

Get early access ← All of EU AI Act in code