SYS/01 — DATA TOOLING
forge-prep
Data-readiness toolkit that audits and cleans enterprise corpora before they reach Mistral's Forge fine-tuning pipeline.
- VERSION
- 0.1.0
- TESTS
- 38 passing
- DEPS
- stdlib only
- LICENSE
- Apache-2.0
forge-prep is the pre-flight checklist before an enterprise commits Forge compute budget. It scores a corpus on readiness, tells you exactly what to fix, and outputs a clean directory ready for custom model training.
FUNCTION
Two commands. `forge-prep audit <path>` scans a corpus and produces a 0–100 Forge Readiness Score with a per-dimension breakdown across volume, quality, deduplication, privacy (PII), language focus, and format consistency. `forge-prep clean <path>` deduplicates exact-content matches, replaces PII with typed placeholders, filters files below quality thresholds, and writes a Forge-compatible directory.
ENGINEERING
The core toolkit uses only the Python 3.10+ standard library — no pip-install wall, no version conflicts, no ML frameworks. PII detection covers email, phone, IP, credit card, SSN, IBAN, and French NIR. Deduplication is MD5 content-hash based. Every audit emits both a Markdown report and a machine-readable JSON report that feeds CI/CD and an included React dashboard.
WHY IT EXISTS
Most enterprises can't use Forge because their data isn't ready — duplicated documents, scattered PII, low-quality files, mixed languages, inconsistent formats. forge-prep is the bridge: a direct outreach artifact built to demonstrate exactly what I'd bring to a frontier-lab data team.