Classification and regression are two common types of machine learning tasks. The main difference between classification and regression is the type of output they produce.
In classification, the output is a categorical variable or label that assigns each input data point to a specific class. The goal of classification is to learn a decision boundary that separates the different classes in the data. For example, classifying whether an email is a spam or not spam, or whether a tumor is malignant or benign.
In regression, the output is a continuous numerical variable. The goal of regression is to learn a mathematical function that predicts a numerical value based on the input data. For example, predicting the price of a house based on its features such as location, size, and number of bedrooms.
Another important difference between classification and regression is the type of evaluation metrics used. In classification, common evaluation metrics include accuracy, precision, recall, and F1-score, while in regression, common evaluation metrics include mean squared error, mean absolute error, and R-squared. In summary, classification is used to predict categorical outcomes, while regression is used to predict continuous numerical outcomes.
Classification Predictive Modelling
Classification predictive modelling is a type of machine learning task where the goal is to predict the class or category of a new data point based on its features. This is accomplished by training a classification model on a labelled dataset, where the class labels are known for each data point. It is is the task of approximating a mapping function (f) from input variables (X) to discrete output variables (y).
The classification accuracy is the percentage of correctly classified examples out of all predictions made.
For example, if a classification predictive model made 5 predictions and 3 of them were correct and 2 of them were incorrect, then the classification accuracy of the model based on just these predictions would be:
accuracy = correct predictions / total predictions * 100
accuracy = 3 / 5 * 100
accuracy = 60%
An algorithm that is capable of learning a classification predictive model is called a classification algorithm.
The process of building a classification model involves several steps:
- Data preparation: This involves collecting, cleaning, and preprocessing the data to ensure it is in a format suitable for modelling. This includes removing missing values, handling outliers, and transforming features as necessary.
- Feature selection: This involves selecting the most important features that are likely to be predictive of the target class. This can be done using various feature selection techniques, such as correlation analysis or feature importance rankings.
- Model selection: This involves choosing an appropriate classification algorithm, such as logistic regression, decision trees, or support vector machines. The choice of algorithm depends on the nature of the data and the problem at hand.
- Model training: This involves using the labelled data to train the chosen model on the selected features.
- Model evaluation: This involves assessing the performance of the trained model on a separate test dataset, which was not used during training. Common evaluation metrics for classification models include accuracy, precision, recall, F1-score, and ROC-AUC.
- Model deployment: Once the model has been trained and evaluated, it can be deployed in a production environment to make predictions on new, unseen data points.
Classification predictive modelling has many real-world applications, including spam detection, fraud detection, sentiment analysis, and disease diagnosis, among others.
Regression Predictive Modelling
Regression predictive modelling is a type of machine learning task where the goal is to predict a numerical value or continuous variable based on the input features. This is accomplished by training a regression model on a labelled dataset, where the target variable is a continuous numerical value.
There are many ways to estimate the skill of a regression predictive model, but perhaps the most common is to calculate the root mean squared error, abbreviated by the acronym RMSE.
For example, if a regression predictive model made 2 predictions, one of 1.5 where the expected value is 1.0 and another of 3.3 and the expected value is 3.0, then the RMSE would be:
RMSE = sqrt(average(error^2))
RMSE = sqrt(((1.0 - 1.5)^2 + (3.0 - 3.3)^2) / 2)
RMSE = sqrt((0.25 + 0.09) / 2)
RMSE = sqrt(0.17)
RMSE = 0.412
A benefit of RMSE is that the units of the error score are in the same units as the predicted value.
An algorithm that is capable of learning a regression predictive model is called a regression algorithm.
Some algorithms have the word “regression” in their name, such as linear regression and logistic regression, which can make things confusing because linear regression is a regression algorithm whereas logistic regression is a classification algorithm.
The process of building a regression model involves several steps:
- Data preparation: This involves collecting, cleaning, and preprocessing the data to ensure it is in a format suitable for modelling. This includes removing missing values, handling outliers, and transforming features as necessary.
- Feature selection: This involves selecting the most important features that are likely to be predictive of the target variable. This can be done using various feature selection techniques, such as correlation analysis or feature importance rankings.
- Model selection: This involves choosing an appropriate regression algorithm, such as linear regression, decision trees, or random forest. The choice of algorithm depends on the nature of the data and the problem at hand.
- Model training: This involves using the labelled data to train the chosen model on the selected features.
- Model evaluation: This involves assessing the performance of the trained model on a separate test dataset, which was not used during training. Common evaluation metrics for regression models include mean squared error, mean absolute error, and R-squared.
- Model deployment: Once the model has been trained and evaluated, it can be deployed in a production environment to make predictions on new, unseen data points.
Regression predictive modelling has many real-world applications, including stock price prediction, housing price prediction, demand forecasting, and customer churn prediction, among others.
Classification vs Regression
Classification predictive modelling problems are different from regression predictive modelling problems.
- Classification is the task of predicting a discrete class label.
- Regression is the task of predicting a continuous quantity.
There is some overlap between the algorithms for classification and regression; for example:
- A classification algorithm may predict a continuous value, but the continuous value is in the form of a probability for a class label.
- A regression algorithm may predict a discrete value, but the discrete value is in the form of an integer quantity.
Some algorithms can be used for both classification and regression with small modifications, such as decision trees and artificial neural networks. Some algorithms cannot, or cannot easily be used for both problem types, such as linear regression for regression predictive modelling and logistic regression for classification predictive modelling.
Importantly, the way that we evaluate classification and regression predictions vary and does not overlap, for example:
- Classification predictions can be evaluated using accuracy, whereas regression predictions cannot.
- Regression predictions can be evaluated using root mean squared error, whereas classification predictions cannot.