Logistic Regression

Overview

Logistic regression is one of the supervised learning methods. It is often used for classification and predictive analytics for binary classification problems. It is known as a sigmoid function that takes input as independent variables and returns a probability value ranging from 0 to 1. For instance, there are two classes: class 0 and class 1. If the logistic function value for an input is more than 0.5 (threshold value), it belongs to class 1. Otherwise, it belongs to class 0. It is called regression since it is an extension of linear regression, however, it is primarily employed for classification problems. 

The primary difference between linear regression and logistic regression is that the range of logistic regression is bounded between 0 and 1. Also, a linear relationship between independent variables and a dependent variable is not required for the logistic regression model. The output of linear regression is a continuous value that might be anything, whereas logistic regression predicts whether an instance belongs to a specific class or not.

Sigmoid Function

The sigmoid function is a mathematical function that is used to convert anticipated values into probabilities. It converts any real value between 0 and 1 into another value. The logistic regression value must be between 0 and 1, and it cannot exceed this limit, forming a curve similar to the “S” form called sigmoid function. The concept of the threshold value is used in logistic regression to describe the probability of either 0 or 1. For example, numbers above the threshold value likely to be 1, whereas values below the threshold value tend to be 0.

Data Preparation

The data used for logistic regression is Spotify data. Every row represents a song and every column represents audio features in each song. This data was cleaned already.

The logistic regression method is a supervised learning algorithm, which means that it requires labeled data. Some columns were dropped such as artist and track columns. Below is the data before the splitting process. 

The next step after preparing the data is to split the dataset into two subsets: a training set and a testing set. The training set is used to train the model. This is where the model learns patterns, relationships, and features in the data. The model is adjusted and optimized based on this data. It must retain a label, unlike testing set that the label must be removed. However, the testing set is used to evaluate the model’s performance. In this analysis, the data was split into 70% training set and 30% testing set.

Training Set

Testing Set

The important thing to note is that although the target variable (y) is removed from the testing set, it is still retained and used for evaluating the model. 

Results

The confusion matrix is typically used to assess the performance of a classification model. It is a 2×2 matrix from this analysis that represents the results of a binary classification problem. Each cell in the matrix contains information about the model’s predictions and the actual outcomes. Here is how to interpret the elements of this confusion matrix:

  • True Positives (TP): The model correctly predicted positive instances (e.g., correctly identified an event as positive). 
  • True Negatives (TN): The model correctly predicted negative instances (e.g., correctly identified a non-event as negative). 
  • False Positives (FP): The model incorrectly predicted positive instances (e.g., it predicted an event when there was none). Also known as a Type I error. 
  • False Negatives (FN): The model incorrectly predicted negative instances (e.g., it failed to identify an event when there was one). Also known as a Type II error.

Moreover, accuracy, precision (positive predictive value), recall (sensitivity, true positive rate), and f-1 score (a balance between precision and recall) can be calculated by the values above.

  • Accuracy: The number of correct predictions over all predictions
  • Precision: A measure of how many of the positive predictions made are correct (true positives).
  • Recall (Sensitivity): A measure of how many of the positive cases the classifier correctly predicted, over all the positive cases in the data. 
  • F1-score: The harmonic mean of precision and recall. It provides a balanced measure that considers both false positives and false negatives.
  • Support: The number of actual occurrences of each class in the dataset. It represents the number of instances in each class.

The results of logistic regression show below.

From the results above:

  • 6157 represents the number of true negatives (TN): Instances that are actually in class 0 (negative) and were correctly classified as class 0. 
  • 0 represents the number of false positives (FP): Instances that are actually in class 0 (negative) but were incorrectly classified as class 1 (positive). 
  • 1303 represents the number of false negatives (FN): Instances that are actually in class 1 (positive) but were incorrectly classified as class 0 (negative). 
  • 0 represents the number of true positives (TP): Instances that are actually in class 1 (positive) and were correctly classified as class 1

Accuracy measures the overall correctness of the model’s predictions. The overall accuracy is 0.83, which means the model is correct in its predictions for 83% of the instances. However, this metric can be misleading in imbalanced datasets, as high accuracy can result from correctly classifying the majority class while completely missing the minority class.

Conclusion

The logistic regression model was used to categorize songs as hit or non-hit based on the available dataset and its attributes. Several criteria were used to assess the model’s performance, including accuracy, precision, recall, and F1-score.

The model’s performance reveals that the model achieved an accuracy of approximately 83%, implying that it correctly classified 83% instances. However, it is important to highlight that the model’s performance is not totally good, as evidenced by other metrics.

Moreover, the precision for the hit class is low saying that the model may produce a high number of false positives. This means that some songs that are not hit may be mistakenly labeled as such. The recall for the hit class is likewise poor, indicating that the model may miss many true hits (high false negatives). The F1-score, which balances precision and recall, is rather low, indicating that the model’s overall performance might be improved.

In conclusion, the logistic regression model shows potential for classifying songs as hit or non-hit, it may not be the most suitable model for this particular task due to the results we got. The low precision and recall for the hit class are areas of concern that should be addressed to enhance the model’s performance.