Clearing the Pipeline Jungle with FeatureTransforms.jl
2021-07-28, 12:30–12:40 (UTC), Purple

The prevalence of glue code in feature engineering pipelines poses many problems in conducting high-quality, scalable research. In worst-case scenarios, the technical debt racked up by overgrown “pipeline jungles” can preclude further development and grind promising projects to a halt [1]. This talk will show how the FeatureTransforms.jl package can help make feature engineering a more sustainable practice for users without sacrificing the flexibility they desire.

Feature engineering is an essential component in all machine learning and data science workflows. It is often an exploratory activity in which the pipeline for a particular set of features tends to be developed iteratively as new data or insights are incorporated.

As the feature complexity grows over time it is very common for code to devolve into unwieldy “pipeline jungles” [1], which pose multiple problems to developers. They are often brittle, with highly-coupled operations that make it increasingly difficult to make isolated changes. The over-entanglement of such pipelines also means they are difficult to unit test and debug effectively, making them particularly error-prone. Since adding to this complexity is often easier than investing in refactoring it, pipeline jungles tend to be more susceptible to incurring technical debt over time, which can impact the project’s long-term success.

In this talk, we will showcase some of the key features of the FeatureTransforms.jl package, such as the composability, reusability, and performance of common transform operations, that were designed to help mitigate the problems in our own pipeline jungles..

FeatureTransforms.jl is conceptually different from other widely-known packages that provide similar utilities for manipulating data, such as DataFramesMeta.jl, DataKnots.jl, and Query.jl. These packages provide methods for composing relational operations to filter, join, or combine structured data. However, a query-based syntax or an API that only supports one type are not the most suitable for composing the kinds of mathematical transformations, such as one-hot-encoding, that underpin most (non-trivial) feature engineering pipelines, which this package aims to provide.

The composability of transforms reflects the practice of piping the output of one operation to the input of another, as well as combining the pipelines of multiple features. Reusability is achieved by having native support for the Tables and AbstractArray interfaces, which includes tables such as DataFrames, TypedTables, LibPQ.Result, etc, and arrays such as AxisArrays, KeyedArrays, and NamedDimsArrays. This flexible design allows for performant code that should satisfy the needs of most users while not being restricted to (or by) any one data type.

[1] Sculley, David, et al. "Hidden technical debt in machine learning systems." Advances in neural information processing systems 28 (2015): 2503-2511.

I'm a Research Software Engineer at InveniaLabs, UK.
Interested in the application of Julia for scalable, sustainable, research.