Skip to content

ML Pipeline Template

ID: ml-pipeline

A reproducible machine learning project template. Includes a structured source layout for data loading, feature engineering, model training, and evaluation. MLflow tracks experiments automatically. Jupyter Lab is available for exploration.

What's Included

my-ml-pipeline/
  requirements.txt              # Python dependencies
  README.md                     # Project-specific setup guide
  .gitignore
  Makefile                      # Convenience commands for common tasks
  setup.py                      # Package installation for src/
  config/
    config.yaml                 # Experiment and path configuration
  data/
    .gitkeep                    # Placeholder; add your datasets here
  models/
    .gitkeep                    # Placeholder; trained models are saved here
  notebooks/
    01_explore.ipynb            # Starter exploration notebook
  src/
    __init__.py
    data/
      loader.py                 # Data loading utilities
    features/
      pipeline.py               # Feature engineering pipeline
    models/
      train.py                  # Model training script
      evaluate.py               # Model evaluation script
  tests/
    __init__.py
    test_pipeline.py            # Pipeline unit tests

Stack

LayerTechnologyVersion
LanguagePython3.11+
NotebooksJupyter LabLatest
Experiment trackingMLflowLatest
ML libraryscikit-learnLatest
Data manipulationpandasLatest
ContainerDockerAny

Usage

Scaffold a new ML pipeline project:

bash
npx forgekit-cli new my-ml-pipeline --template ml-pipeline

Or run the interactive wizard:

bash
npx forgekit-cli new

Setup

bash
cd my-ml-pipeline
pip install -r requirements.txt
pip install -e .           # Install src/ as a local package

Make Commands

The template includes a Makefile with common commands:

CommandDescription
make trainRun the model training script
make evaluateRun the model evaluation script
make notebookStart Jupyter Lab
make testRun the test suite
make cleanRemove compiled files and cached artifacts

Example:

bash
make train

MLflow Experiment Tracking

MLflow is configured automatically. Every training run logs parameters, metrics, and artifacts.

Start the MLflow UI to view your experiment history:

bash
mlflow ui

Open http://localhost:5000 in your browser.

Runs are stored in the mlruns/ directory by default. To use a remote MLflow tracking server, set the MLFLOW_TRACKING_URI environment variable:

bash
export MLFLOW_TRACKING_URI=http://your-mlflow-server:5000
make train

Configuration

Edit config/config.yaml to adjust experiment parameters, data paths, and model hyperparameters without touching source code:

yaml
data:
  path: data/raw/dataset.csv
  test_size: 0.2

model:
  n_estimators: 100
  max_depth: 5
  random_state: 42

mlflow:
  experiment_name: my-ml-pipeline

Customization Tips

Swap the model: Replace the scikit-learn estimator in src/models/train.py with any scikit-learn compatible model. XGBoost, LightGBM, and similar libraries work without other changes.

Add deep learning: Add torch or tensorflow to requirements.txt. Create a new trainer class in src/models/ following the same interface as the existing trainer.

Connect to cloud storage: Update src/data/loader.py to read from S3 or GCS using boto3 or google-cloud-storage. Store the bucket name in config/config.yaml.

Released under the Apache 2.0 License.