DataDrivenInvestor

empowerment through data, knowledge, and expertise. subscribe to DDIntel at…

Follow publication

How to Build Custom Transformers in Scikit-Learn

Extend Built-in Functionality with Your Own Pipeline Compatible Preprocessing Tools

Jake Miller Brooks
DataDrivenInvestor
Published in
3 min readDec 7, 2020

--

We all know the importance of preprocessing in a machine learning project. It typically makes sense to handle some missing values, scale various features, one-hot encode others, etc., and scikit-learn has prebuilt tools that do a great job of all of these steps right out of the box. But what about adding new features, or applying a custom transformation? Did you know that scikit-learn also makes it easy to build these steps into a standard pipeline workflow? Here’s how!

FunctionTransformer

Let’s start simple with a great tool for on the fly transformations: FunctionTransformer. FunctionTransformer can be used for everything from applying a predefined function to a feature, to selecting specific columns in your feature set. The basic idea is that FunctionTransformer accepts a function (you can also pass an inverse function), and applies the function to the data via a fit_transform method. This makes it a great tool for uncomplicated transformations that can be encapsulated in a simple function; you can almost think of this as the “lambda function” of scikit-learn preprocessing. We demonstrate a few use cases below.

Selecting Features:

Here we use FunctionTransformer to select two of the thirteen features in the full Boston Housing dataset.

A Histogram showing the distribution of ‘CRIM’ feature in Boston Housing Data
Output of the above Code (Image by Author)

Simple Transformations:

We were able to select the features we wanted, but perhaps we’d like to scale the values of the ‘CRIM’ (Crime Rate) feature. We could use StandardScaler, but instead we’ll apply a log scale to demonstrate how to use FunctionTransformer to apply simple functions on the fly.

A Histogram showing the distribution of ‘CRIM’ feature in Boston Housing Data after Log Scaling
Output of the above Code (Image by Author)

Fully Customized Estimators

What if you’d like to extend this basic functionality with more complex transformations? Engineer some features? Add some hyperparameters to make grid-searching a breeze? Scikit-learn has you covered here too, and we’ve got an example for you below.

Before we get started, we should note that while documentation is in general a good starting place for any package, scikit-learn specifically is known for having exceptionally good docs. Scikit-learn objects (“estimators,” in sklearn parlance) have some general conventions, and it’s good practice to follow these so they play nicely with other pipeline style concepts. To that end, scikit-learn makes several tools available to easily implement these features in a compatible way, and you can read more about why we’re using them in the code below on this page.

For this section, we’re going to use a dataset that has a few more features to make things a bit more interesting. That dataset is the King County Housing dataset, and you can download it here.

And without further ado, here’s some code:

Gain Access to Expert View — Subscribe to DDI Intel

--

--

Written by Jake Miller Brooks

Data Scientist, lifelong learner, background in Housing Finance, Transportation and Infrastructure.

No responses yet

Write a response