The Roadmap to Model Development Part 3

In last week’s blog post, we finished up the third step (Data Preprocessing for Prediction) and began creating our first models in the final step (Model Development). Today, we will create four more models, and test each of them to compare their accuracy and runtimes.

Creating the models

Last blog post, we created only the Logistic Regression model and the K-Nearest Neighbors model. We will add the following four models today: Support Vector Machine, Decision Trees, Naive Bayes, and Random Forest. We will use the Sci-kit learn library for all of these.

  • Support Vector Machine

In this step, we create the Support Vector Machine model and name it svc. We then append its full name to our ‘classifier’ array, and the instance to our ‘imported_as’ array. The Support Vector Machine model works by separating the data points into ‘hyperplanes’ that have optimal decision boundaries.

  • Decision Tree

Here is the code implemented to create the Decision Tree model. It also appends the name and instance to the ‘classifier’ and ‘imported_as’ arrays, respectively. The Decision Tree model works by creating a ‘tree’ with multiple branches, with a binary question, leading to more branches.

  • Naive Bayes

In this code block, we create the Naive Bayes model and append its name/instance to the ‘classifier’ and ‘imported_as’ arrays. Naive Bayes is a model that attempts to perform classification by assuming all features are independent of each other. 

  • Random Forest Classifier

This is the code to create the Random Forest Classifier Model. We then append it to the ‘classifier’ array and the ‘imported_as’ array. The Random Forest Classifier works by creating multiple decision trees and weighing all their results before a decision.

Testing and measuring accuracy

We can now create our main class, that we will use to create an object with all the models so we can easily compare their accuracies. 

We create a class called ‘Modeling’ in which we have a few functions, ‘fit’ and ‘results’ these two can be called on the ‘Modeling’ class object to fit the models, and to get the results, respectively. Now, we will make an array that holds all the models we want to test. We could also use the ‘imported_as’ array, but for simplicity we will make a new array and name it ‘models_to_test’.

Finally, we can create an object of the ‘Modeling’ class and use our ‘fit’ and ‘results’ functions on it. We will use our previously made arrays of x and y training data, x and y testing data, and our ‘models_to_test’ array as the arguments. 

If it runs properly, you should be returned with ‘name of model’ has been fit for each of the models, and a pandas dataframe that holds the accuracy and runtime of each model. 

The Random Forest Classifier seems to be the most accurate. If you want to test this multiple times to see if the RFC model is ever beat, you can use a simple for loop to get the top model as many times as you’d like.

You can run it as many times as you like, and you’ll see that the Random Forest Classifier is always the most accurate. However, this comes with a price of a significantly higher runtime than the other models (0.4 seconds vs 0.01 and lower). This is why we trained multiple models, so we could compare which one would deliver enough accuracy while still being appropriate for the hardware it will run on. From these models, you can choose which one is most appropriate for your use case.

What’s next?

We now have a few accurate models to choose from for the purpose of predictive maintenance. However, the data used to train and test them is not our own. This project is meant to be easily adaptable to different sets of data, so next we will collect our own data for predictive maintenance and attempt to employ the model(s) on that dataset.

The Roadmap to Model Development Continued

Last week we took a look at the first two major steps in the process of model development, Data Preprocessing and Exploratory Data Analysis. In today’s article we will continue on the same roadmap, looking at two more steps that are arguably more interesting. These include the final data preprocessing to prep for model development, and developing the first few models.

Data Preprocessing for Prediction

What is the difference between the first data preprocessing step and the second? In the first step, you will remember that all we did was drop unwanted features. This is solely to prepare the data so we can perform Exploratory Data Analysis. In the second preprocessing step, we do all the other necessary actions for model deployment. 

  1. Encoding categorical features

Label encoding refers to converting the labels into a numeric form so as to convert them into the machine-readable form. We will use it on our training data, and using the “.fit” sci-kit method it will figure out the unique values and assign a value to it, returning the encoded labels.

This is what the code looks like:

  1. Splitting Test and Training Data

Every machine learning model needs test and training data to learn and improve. In this step, we separate the test and training data from the dataset consisting of ten thousand values. We use the “train_test_split” module from sci-kit and “.drop” from the pandas library to get four datasets, x_train, y_train, x_test, and y_test.

This is what the code looks like:

  1. Feature Scaling

Feature Scaling is a technique to normalize/standardize the independent features present in the data in a fixed range. It is done to handle highly varying magnitudes or values/units. Without feature scaling, a machine learning algorithm will weigh greater values higher and almost disregard smaller values, not taking into account the units. Note that we only do this to the x datasets, because the output does not need to be normalized.

This is what the code looks like:

Here is an image that shows how the distribution of the data is changed after feature scaling. 

Model Development

One of the final and most important in any model’s development, the actual creation of the model itself. For the best accuracy possible, we will be creating multiple models and comparing their accuracy to choose the best one. In this week’s post we will look at a couple and next week we will look at the rest. We will first create a place to store our model’s names and their actual instances.

Here is the code I used to do this:

  • The Logistic Regression Model

The Logistic Regression model is a supervised learning algorithm that investigates the relationship between a dependent and independent variable and produces results in a binary format (fail or not fail). This is used to predict failure.

Here is the code for the linear regression model, and for appending it to our arrays.

  • K- Nearest Neighbors Model

K-Nearest Neighbors is also a supervised machine learning algorithm which is mostly used to solve classification problems. It stores all the available cases and classifies new cases based on similarity. “K” is the number of nearest neighbors.

This image shows how k-nearest neighbors uses the location of data distribution to predict.

This is the code to implement the KNN model and append it:

In the next week’s blog post, we will finish the final step by creating more models, including Support Vector Machine, Random Forest, Naive Bayes, and Decision Trees. We will then test the accuracy of all of them to choose the most accurate one.