Feature Engineering Made Simple | PyData London 2022

Kajanan Sangaralingam and Anindya Datta present:

Feature Engineering Made Simple

Of all the choices made by data scientists in the course of building and operating models, feature selection is one of the most critical. Features have a substantive impact on a model’s quality, including its predictive accuracy and resilience. Therefore, feature engineering is one of the most important components of the Machine Learning workflow.

Unfortunately, as most ML scientists and practitioners are aware, Feature Engineering is more art than science. It is ad-hoc, messy, terribly error-prone and ends up consuming 70-80% of the effort and time when building models, often resulting in sub-optimal feature selection leading to low-quality models.

While there are a host of tools, mostly open-source, that help with parts of the feature engineering process, in particular in performing exploratory data analysis (EDA), their impact is modest: 1. The biggest problem in feature engineering is task orchestration – methodically performing a set of steps leading up to a set of “good”, model-ready features. Existing tools, such as PANDAS based packages, enable the performance of individual tasks (e.g., outlier detection) but the act of systematic orchestration is still totally left up to the modeller, and usually leads to a very ad-hoc, trial-and-error feature engineering workflow. 2. There are a few key problems in feature engineering that have no packaged solutions at all. One such problem is “cold-start” – when starting to select candidate features, what should the modeller do? The entire space of possible features for a given problem is usually very large, so a small subset needs to be identified for investigation – suboptimal candidate feature selection is usually very detrimental. This is one of the hardest issues in feature engineering. 3. Finally, virtually every open-source library is scale challenged, performing the in-memory computation in a single thread. When the base data has a meaningful scale, these are simply impractical to use.

In this tutorial, we will introduce new ways of performing feature engineering, turning it into a systematic, procedural and scalable process, which is substantively more efficient than how it occurs currently.

www.pydata.org

PyData is an educational program of NumFOCUS, a 501(c)3 non-profit organization in the United States. PyData provides a forum for the international community of users and developers of data analysis tools to share ideas and learn from each other. The global PyData network promotes discussion of best practices, new approaches, and emerging technologies for data management, processing, analytics, and visualization. PyData communities approach data science using many languages, including (but not limited to) Python, Julia, and R.

PyData conferences aim to be accessible and community-driven, with novice to advanced level presentations. PyData tutorials and talks bring attendees the latest project features along with cutting-edge use cases.

00:00 Welcome!

Want to help add timestamps to our YouTube videos to help with discoverability? Find out more here: https://github.com/numfocus/YouTubeVi...

Home