Solves Data Lineage Blindness by tracking granular preprocessing steps.

ml-audit is a lightweight Python library designed to bring transparency and reproducibility to data preprocessing. Unlike standard experiment trackers that treat preprocessing as a black box, this library records every granular transformation applied to your pandas DataFrame.

Why ML Audit?

It solves “Data Lineage Blindness”.

Most data science teams suffer from a gap in their experiment tracking:

Features

Installation

You can install ml-audit via pip:

pip install ml-audit

For SMOTE balancing support, install with the balance extra:

pip install ml-audit[balance]

Interactive Demo

Try out the library instantly in your browser: Open In Colab

Quick Start

1. Initialize the Recorder

import pandas as pd
from ml_audit import AuditTrialRecorder

# Load your data
df = pd.read_csv("data.csv")

# Initialize the auditor wrapped around your dataframe
auditor = AuditTrialRecorder(df, name="experiment_v1")

2. Apply Preprocessing

Chain methods fluently. Operations are applied immediately to auditor.current_df.

auditor.filter_rows("age", ">=", 18) \
       .impute(["salary", "score"], strategy='median') \
       .scale(["salary", "age"], method='minmax') \
       .encode("gender", method='onehot') \
       .balance_classes("churn", strategy='oversample') # Handles imbalanced data

3. Access Data

processed_df = auditor.current_df
print(processed_df.head())

4. Export & Visualize

Save the audit trail. This will generate a JSON file (audit_trails/) and an HTML visualization (visualizations/).

auditor.export_audit_trail("audit.json")
# Output:
# - audit_trails/audit.json
# - visualizations/audit.html

Detailed API Documentation

All methods support Method Chaining (returning self).

1. Imputation (impute)

Fill missing values in one or more columns using statistical strategies or specific methods.

Signature:

auditor.impute(column, strategy='mean', fill_value=None, method=None)

Parameters:

Examples:

# Impute multiple columns with median
auditor.impute(["age", "salary"], strategy='median')

# Fill with a constant value (e.g., 0)
auditor.impute("bonus", strategy='constant', fill_value=0)

# Forward fill for time-series data
auditor.impute("stock_price", method='ffill')

2. Scaling (scale)

Scale numerical features to a specific range or distribution.

Signature:

auditor.scale(column, method='standard')

Parameters:

Examples:

# Standardize normally distributed features
auditor.scale(["height", "weight"], method='standard')

# Normalize image pixel values
auditor.scale("pixels", method='minmax')

3. Encoding (encode)

Encode categorical features into numeric form.

Signature:

auditor.encode(column, method='onehot', target_col=None)

Parameters:

Examples:

# One-hot encode low-cardinality nominal data
auditor.encode("color", method='onehot') 
# Result: color_red, color_blue, ...

# Label encode ordinal data
auditor.encode("quality", method='label')
# Result: 0, 1, 2...

# Target encode high-cardinality data
auditor.encode("zip_code", method='target', target_col="house_price")

4. Transformation (transform)

Apply mathematical transformations to columns.

Signature:

auditor.transform(column, func='log')

Parameters:

Examples:

# Log transform skewed data
auditor.transform("income", func='log')

5. Binning (bin_numeric)

Discretize continuous variables into bins (buckets).

Signature:

auditor.bin_numeric(column, bins=5, strategy='quantile', labels=None)

Parameters:

Examples:

# Create 4 quartiles for age
auditor.bin_numeric("age", bins=4, strategy='quantile')

6. Date Extraction (extract_date_features)

Extract features from datetime columns.

Signature:

auditor.extract_date_features(column, features=['year', 'month', 'day', 'weekday'])

Parameters:

Examples:

# Extract year and month from 'joined_date'
auditor.extract_date_features("joined_date", features=['year', 'month'])
# Creates columns: joined_date_year, joined_date_month

7. Balancing (balance_classes)

Balance the dataset based on the target variable.

Signature:

auditor.balance_classes(target, strategy='oversample', random_state=42)

Parameters:

Examples:

# Handle imbalanced dataset using SMOTE
auditor.balance_classes("is_fraud", strategy='smote')

8. Filtering & Dropping

Basic dataframe manipulations.

Signatures:

auditor.filter_rows(column, operator, value)
auditor.drop_columns(columns)

Examples:

# Keep only adults
auditor.filter_rows("age", ">=", 18)

# Remove PII
auditor.drop_columns(["ssn", "email"])

9. Generic Operations (track_pandas)

Track any pandas dataframe method that isn’t natively built-in.

Signature:

auditor.track_pandas(method_name, *args, **kwargs)

Examples:

# Track a rename operation
auditor.track_pandas("rename", columns={"old_name": "new_name"})

# Track dropping NaNs
auditor.track_pandas("dropna", subset=["critical_col"])

10. Reproducibility

Verify Lineage: You can check if re-running the recorded operations on the original data produces the exact same result (by hash).

if auditor.verify_reproducibility():
    print("Pipeline is scientifically reproducible!")
else:
    print("Pipeline result mismatch!")
# Also available: auditor.replay() returns the re-computed dataframe independently

Visualization

When you run export_audit_trail(), an HTML file is generated in the visualizations/ folder. This interactive timeline shows:

  1. Step Sequence: The order of operations.
  2. Parameters: What strategy/method was used (e.g., strategy='median').
  3. Data Shape: How the row/column count changed.
  4. Schema: How columns were added or removed.

License

MIT License. Free to use for personal and commercial projects.