SYS/01 · DATA TOOLING

forge-prep

Data-readiness toolkit that audits and cleans enterprise corpora before they reach Mistral's Forge fine-tuning pipeline.

GitHub

PythonCLIPyPIGitHub Actions

VERSION: 0.1.0
TESTS: 38 passing
DEPS: stdlib only
LICENSE: Apache-2.0

forge-prep is the pre-flight checklist before an enterprise commits Forge compute budget. It scores a corpus on readiness, tells you exactly what to fix, and outputs a clean directory ready for custom model training.

KEY FACTS

Published Python package (v0.1.0), zero external dependencies, Python 3.10+ stdlib only.
38-test suite with GitHub Actions CI; full packaging via pyproject.toml.
audit command emits a 0–100 Forge Readiness Score across six dimensions: volume, quality, dedup, PII, language, format.
clean command deduplicates, scrubs typed PII, and filters low-quality files into a Forge-ready corpus.

What forge-prep does

The audit command

Two commands. `forge-prep audit <path>` scans a corpus and produces a 0–100 Forge Readiness Score with a per-dimension breakdown across volume, quality, deduplication, privacy (PII), language focus, and format consistency.

The clean command

`forge-prep clean <path>` deduplicates exact-content matches, replaces PII with typed placeholders, filters files below quality thresholds, and writes a Forge-compatible directory.

How it's built

The core toolkit uses only the Python 3.10+ standard library: no pip-install wall, no version conflicts, no ML frameworks. PII detection covers email, phone, IP, credit card, SSN, IBAN, and French NIR. Deduplication is MD5 content-hash based. Every audit emits both a Markdown report and a machine-readable JSON report that feeds CI/CD and an included React dashboard.

Why it exists

Most enterprises can't use Forge because their data isn't ready: duplicated documents, scattered PII, low-quality files, mixed languages, inconsistent formats. forge-prep is the bridge.

Back to all projects