Applied Machine Learning for Tabular Data
Applied Machine Learning for Tabular Data, by Max Kuhn and Kjell Johnson, is a practical guide for predictive modeling, set for release on June 17, 2024, emphasizing community engagement and statistical methods.
Read original articleApplied Machine Learning for Tabular Data is a forthcoming practical guide aimed at helping users develop predictive models from tabular data. Authored by Max Kuhn and Kjell Johnson, the book is set to be published on June 17, 2024. It emphasizes a holistic approach to predictive modeling, integrating feature engineering with machine learning models, and addressing post-modeling activities that are often overlooked. The authors aim to create an open resource, encouraging community contributions and discussions, with materials available on a GitHub repository under a Creative Commons license. The intended audience includes data analysts, statisticians, and anyone interested in predictive modeling, regardless of their expertise level. The book will cover essential topics such as data preparation, handling numeric and categorical predictors, and model optimization while avoiding the term "artificial intelligence" to emphasize the statistical nature of the methods discussed. The authors plan to provide supplementary materials for coding, specifically using R's tidymodels framework, and are open to contributions for Python resources. Exercises will be included to reinforce concepts, and readers are encouraged to engage through public forums or GitHub for questions and contributions. The book aims to build intuition about the modeling process, addressing common pitfalls and the effectiveness of different models. Overall, it seeks to be a comprehensive resource for understanding and applying machine learning techniques to tabular data.
Related
From the Tensor to Stable Diffusion
The GitHub repository offers a comprehensive machine learning guide covering deep learning, vision-language models, neural networks, CNNs, RNNs, and paper implementations like LeNet, AlexNet, ResNet, GRU, LSTM, CBOW, Skip-Gram, Transformer, and BERT. Ideal for exploring machine learning concepts.
Machine Learning Systems with TinyML
"Machine Learning Systems with TinyML" simplifies AI system development by covering ML pipelines, data collection, model design, optimization, security, and integration. It emphasizes TinyML for accessibility, addressing model architectures, training, inference, and critical considerations. The open-source book encourages collaboration and innovation in AI technology.
Guide to Machine Learning with Geometric, Topological, and Algebraic Structures
The paper discusses the shift in machine learning towards handling non-Euclidean data with complex structures, emphasizing the need to adapt classical methods and proposing a graphical taxonomy to unify recent advancements.
Data Science Tutorials in Julia
This website offers Data Science Tutorials in Julia, focusing on machine learning with the MLJ toolbox. It covers data basics, models, ensembles, statistical learning, and end-to-end projects like Telco Churn and AMES, catering to various skill levels. Recommendations aid users new to Julia and machine learning.
Linear Algebra for Data Science
Professors Kyunghyun Cho and Wanmo Kang have created a linear algebra textbook focused on data science, emphasizing practical concepts like SVD, with a non-traditional structure and positive feedback from KAIST students.
1. Make sure you don't have any impossible nonsense or data leakage in your features or target
2. Split your data in an intelligent way (temporally for time-series, group-wise for hierarchical data)
3. Try a really simple linear regression / logistic regression model to get a "dumb" baseline for your accuracy/error metric and make sure it is reasonable
4. If you need interpretability, consider a GAM, else, throw it into XGBoost and you'll get state of the art results
You can also start with scikit-learn's `LogisticRegressionCV`, if a linear model is more palatable to your audience.
The bigger challenge is to reliably estimate how good your model is.
It's not about getting the best performance – it's about getting the "real" performance. How will your model _really_ do on future unseen datasets?
The answer to this challenge is cross-validation. But what are the questions?
There are 2 very different questions for which the answer is cross-validation.
One is, which hyper parameters to use with your model?
The second is, what is the generalization performance of the model?
This requires 2 separate applications (loops) of cross-validation. The authors of this book talk about this in terms of having a "validation set" and a "test set" (Sometimes, these terms are switched around, and there is also "holdout set". It's critical to know how you, and the rest of your team, are using these terms in your modeling)
A robust way to implement these 2 CV's is with nested cross-validation, readily available in many packages – and also, should be "fast enough" using modern computers.
One exercise that remains is: with nested CV, which model do you pick as your "production" model?
That is also a bit tricky. Reading things like the following can help: https://stats.stackexchange.com/q/65128/207989
EDIT: for those inclined, here is a paper on why you need 2 loops of CV: "On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation" by Cawley and Talbot: https://jmlr.csail.mit.edu/papers/volume11/cawley10a/cawley1...
Related
From the Tensor to Stable Diffusion
The GitHub repository offers a comprehensive machine learning guide covering deep learning, vision-language models, neural networks, CNNs, RNNs, and paper implementations like LeNet, AlexNet, ResNet, GRU, LSTM, CBOW, Skip-Gram, Transformer, and BERT. Ideal for exploring machine learning concepts.
Machine Learning Systems with TinyML
"Machine Learning Systems with TinyML" simplifies AI system development by covering ML pipelines, data collection, model design, optimization, security, and integration. It emphasizes TinyML for accessibility, addressing model architectures, training, inference, and critical considerations. The open-source book encourages collaboration and innovation in AI technology.
Guide to Machine Learning with Geometric, Topological, and Algebraic Structures
The paper discusses the shift in machine learning towards handling non-Euclidean data with complex structures, emphasizing the need to adapt classical methods and proposing a graphical taxonomy to unify recent advancements.
Data Science Tutorials in Julia
This website offers Data Science Tutorials in Julia, focusing on machine learning with the MLJ toolbox. It covers data basics, models, ensembles, statistical learning, and end-to-end projects like Telco Churn and AMES, catering to various skill levels. Recommendations aid users new to Julia and machine learning.
Linear Algebra for Data Science
Professors Kyunghyun Cho and Wanmo Kang have created a linear algebra textbook focused on data science, emphasizing practical concepts like SVD, with a non-traditional structure and positive feedback from KAIST students.