black and white bed linen

Preprocessing Pipeline

The complete preprocessing pipeline of a dataset.

Methodology is Everything

Learning Data Science takes months if not years, however there is one aspect that is fundamentally crucial to it: Methodology.

That is how do I, the data practitioner, approach my problem, my research, my idea without losing track of everything that needs to be done?

Preprocessing addresses issues that most of the data throw at us: variable levels of quality. Missing values, unconventional columns naming, non-standardized and/or non-normalized values.
Preprocessing is also about turning all the data into numbers, and encoding and transformers help do exactly that.

To What End?
Doing that leg work is useful to many purposes. But the knowledge we get from it comes as features that strongly (if so) correlates to the target variable, which is what we want to predict.

In our case here it is 'mpg' or miles per gallon. The dataset is about 400 car models and their different characteristics. Our goal is to use all of them and predict the mpg for any value of any of these characteristics.

Check the work
Right below here you are presented with an interactive notebook where all the preprocessing pipeline work was made. Go through it, execute the code and get the insights I mentioned above.

Make sure to give it a minute to boot first, but if you read all of the above, it should be ready to go by now.
On MOBILE, you may need to turn your mobile to a landscape ratio.

What's Next?

Usually what comes next is the modelization of a model. This big step comprises:

  • use of frameworks

  • training the model

  • tuning hyperparameters

After that comes the evaluation of the model to see potential errors and weaknesses and perfect it if needed.