How to Build Custom Transformers in Scikit-Learn

Extend Built-in Functionality with Your Own Pipeline Compatible Preprocessing Tools

Published in

DataDrivenInvestor

3 min readDec 7, 2020

We all know the importance of preprocessing in a machine learning project. It typically makes sense to handle some missing values, scale various features, one-hot encode others, etc., and scikit-learn has prebuilt tools that do a great job of all of these steps right out of the box. But what about adding new features, or applying a custom transformation? Did you know that scikit-learn also makes it easy to build these steps into a standard pipeline workflow? Here’s how!

FunctionTransformer

Let’s start simple with a great tool for on the fly transformations: FunctionTransformer. FunctionTransformer can be used for everything from applying a predefined function to a feature, to selecting specific columns in your feature set. The basic idea is that FunctionTransformer accepts a function (you can also pass an inverse function), and applies the function to the data via a fit_transform method. This makes it a great tool for uncomplicated transformations that can be encapsulated in a simple function; you can almost think of this as the “lambda function” of scikit-learn preprocessing. We demonstrate a few use cases below.

Selecting Features:

Here we use FunctionTransformer to select two of the thirteen features in the full Boston Housing dataset.

A Histogram showing the distribution of ‘CRIM’ feature in Boston Housing Data — Output of the above Code (Image by Author)

Simple Transformations:

We were able to select the features we wanted, but perhaps we’d like to scale the values of the ‘CRIM’ (Crime Rate) feature. We could use StandardScaler, but instead we’ll apply a log scale to demonstrate how to use FunctionTransformer to apply simple functions on the fly.

A Histogram showing the distribution of ‘CRIM’ feature in Boston Housing Data after Log Scaling — Output of the above Code (Image by Author)

Fully Customized Estimators

What if you’d like to extend this basic functionality with more complex transformations? Engineer some features? Add some hyperparameters to make grid-searching a breeze? Scikit-learn has you covered here too, and we’ve got an example for you below.

Before we get started, we should note that while documentation is in general a good starting place for any package, scikit-learn specifically is known for having exceptionally good docs. Scikit-learn objects (“estimators,” in sklearn parlance) have some general conventions, and it’s good practice to follow these so they play nicely with other pipeline style concepts. To that end, scikit-learn makes several tools available to easily implement these features in a compatible way, and you can read more about why we’re using them in the code below on this page.

How Machine Learning and Artificial Intelligence Changing the Face of eCommerce? | Data Driven…

The eCommerce development company, nowadays, integrating advancement to take customer experience to the next level…

www.datadriveninvestor.com

For this section, we’re going to use a dataset that has a few more features to make things a bit more interesting. That dataset is the King County Housing dataset, and you can download it here.

And without further ado, here’s some code:

Gain Access to Expert View — Subscribe to DDI Intel

DataDrivenInvestor

How to Build Custom Transformers in Scikit-Learn

Extend Built-in Functionality with Your Own Pipeline Compatible Preprocessing Tools

FunctionTransformer

Selecting Features:

Simple Transformations:

Fully Customized Estimators

How Machine Learning and Artificial Intelligence Changing the Face of eCommerce? | Data Driven…

The eCommerce development company, nowadays, integrating advancement to take customer experience to the next level…

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Published in DataDrivenInvestor

Written by Jake Miller Brooks

No responses yet