A Comprehensive Guide For scikit-learn Pipelines

Scikit Learn has a very easy and useful architecture for building complete pipelines for machine learning. In this article, we'll go through a step by step example on how to used the different features and classes of this architecture.

Why?

There are plenty of reasons why you might want to use a pipeline for machine learning like:

Combine the preprocessing step with the inference step at one object.
Save the complete pipeline to disk.
Easily experiment with different techniques of preprocessing.
Pipeline reuse.
Easy cloud deployment.

How?

Alright, now let's get down to business. In this article we'll use a fairly easy and old problem as an example, which is the Regression problem for predicting housing prices.
Download the data and you should have a train.csv file and a test.csv file, we'll load both using pandas.

Loading the data

import pandas as pd
import numpy as np

train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")

## let's create a validation set from the training set

msk = np.random.rand(len(train_df)) < 0.8

val_df = train_df[~msk]

train_df = train_df[msk]

Feature selection

This data has 163 columns, however, we are not going to use all of them.
After doing a bit of EDA we choose a set of nominal, ordinal and numerical columns to work with.


nominal = ["MSZoning", "LotShape", "LandContour", "LotConfig", "Neighborhood",
           "Condition1", "BldgType", "RoofStyle",
           "Foundation", "CentralAir", "SaleType", "SaleCondition"]

ordinal = ["LandSlope", "OverallQual", "OverallCond", "YearRemodAdd",
          "ExterQual", "ExterCond", "BsmtQual", "BsmtCond", "BsmtExposure",
          "KitchenQual", "Functional", "GarageCond", "PavedDrive"]

numerical = ["LotFrontage", "LotArea", "MasVnrArea", "BsmtFinSF1", "BsmtUnfSF",
            "TotalBsmtSF", "1stFlrSF", "2ndFlrSF", "GrLivArea", "GarageArea",
            "OpenPorchSF"]

train_features = train_df[nominal + ordinal + numerical]
train_label = train_df["SalePrice"]

val_features = val_df[nominal + ordinal + numerical]
val_label = val_df["SalePrice"]

test_features = test_df[nominal + ordinal + numerical]

If you want to see the entire selection process and EDA fully explained, you can see the notebook here

Preprocessing

Now let's choose a preprocessing plan, a very straight forward one is the following:

Ordinal features
- Impute missing data with most frequent value
- Use Ordinal Encoding
Nominal Features
- Impute missing data with most frequent value
- Use One Hot Encoding
Numerical Features
- Impute missing data with mean value
- Use Standard Scaling

As you may see, each family of features has its own unique way of getting processed. Let's create a Pipeline for each family.
We can do so by using the sklearn.pipeline.Pipeline Object

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OrdinalEncoder, OneHotEncoder

ordinal_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("encoder", OrdinalEncoder())
])

nominal_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("encoder", OneHotEncoder(sparse=True, handle_unknown="ignore"))
])

numerical_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="mean")),
    ("scaler", StandardScaler())
])

Now let's join all of the above in one pipeline that targets each column with its family's pipeline.
We can do so using the sklearn.compose.ColumnTransformer Object

from sklearn.compose import ColumnTransformer

# here we are going to instantiate a ColumnTransformer object with a list of tuples
# each of which has a the name of the preprocessor
# the transformation pipeline (could be a transformer)
# and the list of column names we wish to transform

preprocessing_pipeline = ColumnTransformer([
    ("nominal_preprocessor", nominal_pipeline, nominal),
    ("ordinal_preprocessor", ordinal_pipeline, ordinal),
    ("numerical_preprocessor", numerical_pipeline, numerical)
])

## If you want to test this pipeline run the following code

# preprocessed_features = preprocessing_pipeline.fit_transform(train_features)

Adding the model to the pipeline

Now that we're done creating the preprocessing pipeline let's add the model to the end.

from sklearn.linear_model import LinearRegression

complete_pipeline = Pipeline([
    ("preprocessor", preprocessing_pipeline),
    ("estimator", LinearRegression())
])

If you're waiting for the rest of the code, I'd like to tell you that that's it. Pretty easy isn't it. If the scikit-learn maintainers ask to take my heart I'd give it to them for such great API.
The training and evaluation process is the same as any normal model

complete_pipeline.fit(train_features, train_label)
score = complete_pipeline.score(val_features, val_label)

print(score)

predictions = complete_pipeline.predict(test_features)

Saving and Loading Pipelines

Now we want to save the entire preprocessing parameters and model parameters of this pipeline to disk and load it whenever needed.
We are going to use joblib for this JOB ... get it? ... sorry.

Save the pipeline

We are going to save the model as a pickle (.pkl) file. The code is fairly simple.

import joblib

pipeline_filename = "my_pipeline.pkl"
joblib.dump(complete_pipeline, pipeline_filename)

Load the pipeline

Now you're on your flask server and you wish to load the model to help a user predict the price of a house, so you want to load the model from disk when you start the server, or whenever a request is sent. That is also fairly simple.

import joblib

pipeline_filename = "path/to/pipeline/file.pkl"

pipeline = joblib.load(pipeline_filename)

## do inference with pipeline.predict
# ...

Full Code


import pandas as pd
import numpy as np

train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")

## let's create a validation set from the training set

msk = np.random.rand(len(train_df)) < 0.8

val_df = train_df[~msk]

train_df = train_df[msk]


nominal = ["MSZoning", "LotShape", "LandContour", "LotConfig", "Neighborhood",
           "Condition1", "BldgType", "RoofStyle",
           "Foundation", "CentralAir", "SaleType", "SaleCondition"]

ordinal = ["LandSlope", "OverallQual", "OverallCond", "YearRemodAdd",
          "ExterQual", "ExterCond", "BsmtQual", "BsmtCond", "BsmtExposure",
          "KitchenQual", "Functional", "GarageCond", "PavedDrive"]

numerical = ["LotFrontage", "LotArea", "MasVnrArea", "BsmtFinSF1", "BsmtUnfSF",
            "TotalBsmtSF", "1stFlrSF", "2ndFlrSF", "GrLivArea", "GarageArea",
            "OpenPorchSF"]

train_features = train_df[nominal + ordinal + numerical]
train_label = train_df["SalePrice"]

val_features = val_df[nominal + ordinal + numerical]
val_label = val_df["SalePrice"]

test_features = test_df[nominal + ordinal + numerical]


from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OrdinalEncoder, OneHotEncoder

ordinal_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("encoder", OrdinalEncoder())
])

nominal_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("encoder", OneHotEncoder(sparse=True, handle_unknown="ignore"))
])

numerical_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="mean")),
    ("scaler", StandardScaler())
])


from sklearn.compose import ColumnTransformer

# here we are going to instantiate a ColumnTransformer object with a list of tuples
# each of which has a the name of the preprocessor
# the transformation pipeline (could be a transformer)
# and the list of column names we wish to transform

preprocessing_pipeline = ColumnTransformer([
    ("nominal_preprocessor", nominal_pipeline, nominal),
    ("ordinal_preprocessor", ordinal_pipeline, ordinal),
    ("numerical_preprocessor", numerical_pipeline, numerical)
])

## If you want to test this pipeline run the following code

# preprocessed_features = preprocessing_pipeline.fit_transform(train_features)


from sklearn.linear_model import LinearRegression

complete_pipeline = Pipeline([
    ("preprocessor", preprocessing_pipeline),
    ("estimator", LinearRegression())
])


complete_pipeline.fit(train_features, train_label)
score = complete_pipeline.score(val_features, val_label)

print(score)

predictions = complete_pipeline.predict(test_features)


import joblib

pipeline_filename = "my_pipeline.pkl"
joblib.dump(complete_pipeline, pipeline_filename)


pipeline = joblib.load(pipeline_filename)

predictions = pipeline.predict(test_df)
print(predictions)

That's it, Congratulations, you've just created, saved and loaded your complete pipeline.
I hope this article was helpful, if not, please tell me how to improve it, I would really appreciate that. Thank you.

Reference Book

My Notebook