Automating feature engineering for supervised learning? Methods, open-source tools and prospects. Thorben Jensen PyConDE & PyDataBerlin 2019 conference

Automating feature engineering for supervised learning? Methods, open-source tools and prospects.

Thorben Jensen

Wednesday 14:35 in Saal 5 wednesday wednesday-1435

Type/Track Talk PyData

Currently, Machine Learning – and especially Supervised Learning – is benefiting heavily from increased automation. This is mainly showing on the fronts of tuning and deployment of models. Automation makes it easier for newcomers to get started. And experienced developers can leverage automation to increase their impact.

While automation of many tasks required for supervised learning is becoming well established, it is less so for the tasks of feature engineering. Feature engineering is one of the most important steps in supervised learning. This is especially true for ‘classical’ models, such as linear and tree-based ones. But also ‘deep’ models, which in principle are capable of learning detailed features from data, benefit from being provided with the right features. As of today, engineering good features remains part of the ‘secret sauce’ of experienced Machine Learning engineers.

Our talk tackles the question of how well this ‘secret sauce’ can be automated with off-the-shelf tools available today. We will do this in three steps. First, we will draw today’s map of automation in Supervised Learning, as perceived by us. Second, we will explain the commonly used approaches to automate feature engineering. At this, focus will be on feature extraction from time series analysis via pre-defined temporal patterns, relational operations, and evolutionary algorithms. Finally, we will compare the performance of open-source libraries that implement these approaches, i.e. tsfresh, featuretools, and TPOT.

For this comparison, we applied these libraries to a task of Time Series Forecasting. The three mentioned libraries are applied in parts, in comparison, or combined with manual feature engineering. Quality of the produced features is then evaluated by using them with the popular xgboost models.

Beyond education and providing blueprints, we also intend to progress the discussion on the role of automation at our work. To provide evidence and a starting point for collaboration, the code of our comparison will be provided. Together with the community, we would like to find out how useful automation is, what types of work for supervised learning could and should be automated. In general, this is connected to the question of whether automation should remain a virtual assistance or if we should expect full automation down the road.

Tags Artificial Intelligence Algorithms Data Science Machine Learning Data Engineering

Level Domain Expertise some Python Skill Level basic

Thorben Jensen

Affiliation: Informationsfabrik GmbH

Thorben Jensen has studied, designed and automated predictive models since many years. After studies in 5 countries, he graduated from the PhD program at Delft University of Technology. His PhD thesis proposes and explores increased use of automation when building models with autonomous agents. On this topic, he has previously spoken on international conferences and published peer-reviewed literature, e.g. here.

visit the speaker at: Twitter • Github • Homepage