July 25th, 2024

Applied Machine Learning for Tabular Data

Applied Machine Learning for Tabular Data, by Max Kuhn and Kjell Johnson, is a practical guide for predictive modeling, set for release on June 17, 2024, emphasizing community engagement and statistical methods.

Read original articleLink Icon
Applied Machine Learning for Tabular Data

Applied Machine Learning for Tabular Data is a forthcoming practical guide aimed at helping users develop predictive models from tabular data. Authored by Max Kuhn and Kjell Johnson, the book is set to be published on June 17, 2024. It emphasizes a holistic approach to predictive modeling, integrating feature engineering with machine learning models, and addressing post-modeling activities that are often overlooked. The authors aim to create an open resource, encouraging community contributions and discussions, with materials available on a GitHub repository under a Creative Commons license. The intended audience includes data analysts, statisticians, and anyone interested in predictive modeling, regardless of their expertise level. The book will cover essential topics such as data preparation, handling numeric and categorical predictors, and model optimization while avoiding the term "artificial intelligence" to emphasize the statistical nature of the methods discussed. The authors plan to provide supplementary materials for coding, specifically using R's tidymodels framework, and are open to contributions for Python resources. Exercises will be included to reinforce concepts, and readers are encouraged to engage through public forums or GitHub for questions and contributions. The book aims to build intuition about the modeling process, addressing common pitfalls and the effectiveness of different models. Overall, it seeks to be a comprehensive resource for understanding and applying machine learning techniques to tabular data.

Related

From the Tensor to Stable Diffusion

From the Tensor to Stable Diffusion

The GitHub repository offers a comprehensive machine learning guide covering deep learning, vision-language models, neural networks, CNNs, RNNs, and paper implementations like LeNet, AlexNet, ResNet, GRU, LSTM, CBOW, Skip-Gram, Transformer, and BERT. Ideal for exploring machine learning concepts.

Machine Learning Systems with TinyML

Machine Learning Systems with TinyML

"Machine Learning Systems with TinyML" simplifies AI system development by covering ML pipelines, data collection, model design, optimization, security, and integration. It emphasizes TinyML for accessibility, addressing model architectures, training, inference, and critical considerations. The open-source book encourages collaboration and innovation in AI technology.

Guide to Machine Learning with Geometric, Topological, and Algebraic Structures

Guide to Machine Learning with Geometric, Topological, and Algebraic Structures

The paper discusses the shift in machine learning towards handling non-Euclidean data with complex structures, emphasizing the need to adapt classical methods and proposing a graphical taxonomy to unify recent advancements.

Data Science Tutorials in Julia

Data Science Tutorials in Julia

This website offers Data Science Tutorials in Julia, focusing on machine learning with the MLJ toolbox. It covers data basics, models, ensembles, statistical learning, and end-to-end projects like Telco Churn and AMES, catering to various skill levels. Recommendations aid users new to Julia and machine learning.

Linear Algebra for Data Science

Linear Algebra for Data Science

Professors Kyunghyun Cho and Wanmo Kang have created a linear algebra textbook focused on data science, emphasizing practical concepts like SVD, with a non-traditional structure and positive feedback from KAIST students.

Link Icon 8 comments
By @levocardia - 4 months
My experience with tabular data textbooks is that they introduce many different techniques, in a relatively un-opinionated way, which doesn't help give you intuitions on what your strategy for real-world problems should be. In practice, almost all tabular data problems I've encountered boil down to:

1. Make sure you don't have any impossible nonsense or data leakage in your features or target

2. Split your data in an intelligent way (temporally for time-series, group-wise for hierarchical data)

3. Try a really simple linear regression / logistic regression model to get a "dumb" baseline for your accuracy/error metric and make sure it is reasonable

4. If you need interpretability, consider a GAM, else, throw it into XGBoost and you'll get state of the art results

By @antipaul - 4 months
Like others say, usually just go for XGBoost, which is increasingly even shown (proven?) in literature, eg: https://arxiv.org/abs/2106.03253

You can also start with scikit-learn's `LogisticRegressionCV`, if a linear model is more palatable to your audience.

The bigger challenge is to reliably estimate how good your model is.

It's not about getting the best performance – it's about getting the "real" performance. How will your model _really_ do on future unseen datasets?

The answer to this challenge is cross-validation. But what are the questions?

There are 2 very different questions for which the answer is cross-validation.

One is, which hyper parameters to use with your model?

The second is, what is the generalization performance of the model?

This requires 2 separate applications (loops) of cross-validation. The authors of this book talk about this in terms of having a "validation set" and a "test set" (Sometimes, these terms are switched around, and there is also "holdout set". It's critical to know how you, and the rest of your team, are using these terms in your modeling)

A robust way to implement these 2 CV's is with nested cross-validation, readily available in many packages – and also, should be "fast enough" using modern computers.

One exercise that remains is: with nested CV, which model do you pick as your "production" model?

That is also a bit tricky. Reading things like the following can help: https://stats.stackexchange.com/q/65128/207989

EDIT: for those inclined, here is a paper on why you need 2 loops of CV: "On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation" by Cawley and Talbot: https://jmlr.csail.mit.edu/papers/volume11/cawley10a/cawley1...

By @jdeaton - 4 months
Spoiler: just use xgboost and you’re done
By @__mharrison__ - 4 months
I've been teaching tabular data modeling to many clients. My usual model of choice these days is XGBoost. Thought, we can often go back to a strong linear model based on our insights from XGBoost.
By @mjhay - 4 months
I appreciate the section on independent component analysis (ICA). It's not well known and very underused. In my experience, it usually works great on heterogeneous tabular data - which PCA usually does poorly on.
By @axpy906 - 4 months
Wow, those are some names I’ve not seen in awhile. I wonder how LLMs do at R?
By @revskill - 4 months
Too many texts is annoying to read and understand. Bad representation.
By @_giorgio_ - 4 months
What do you suggest to learn XGBoost?