Machine Learning - Logistic Regression Algorithm

Exploring Logistic Regression, a key machine learning tool for binary classification, predicting outcomes with precision using a sigmoid function.

Posted by Judith Winkler MBA on March 19th, 2024

Logistic Regression is a supervised machine learning algorithm used for classification tasks to predict the probability that an instance belongs to a given class or not. It is used for binary classifications; it uses a sigmoid function to analyze the relationship between two data factors by taking input as the independent variables and producing a probability between 0 and 1. For example, it can be used to determine the probability of heart attacks, the possibility of enrolling into a university, and identifying spam emails.

Logistic Regression Key Points:

  • Binary Classification- Logistic Regression is frequently used for binary classification; it determines if an instance belongs to one of two classes (e.g. Yes/No, True/False, 0/1). It predicts the output of categorically dependent variables. Linear regression fits a regression line; however, logistic regression fits an “S” shaped logistic function, which predicts two maximum values (0,1).
  • Sigmoid Function- employs the sigmoid function (logistic function) to map predicted values to probabilities, ensuring that the output lies between 0 and 1.

Logistic regression can be classified into binomial, multinomial, and ordinal logistic regression. Binomial means that there can be only two possible types of dependent variables, such as 0 or 1, Pass or Fail, True or False, etc. Multinomial logistic regression means that there can be three or more possible unordered types of dependent variables, such as “cats”, “dogs”, or “sheep”, and in ordinal logistic regression, it can be three or more possible ordered types of dependent variables, such as “low”, “medium”, or “high”.

Logistic Regression Assumptions:

  1. The dependent variable is binary or dichotomous, meaning that it can only have two possible outcomes: true/false, pass/not pass, or male/female.
  2. Independent observations, meaning there is no correlation between any input variables.
  3. A linear relationship between independent variables and log odds.
  4. There should be no outliers in the dataset.
  5. It prefers a large sample size.

Logistic Regression Terminology

  • Independent variables- the predictor factors applied to the dependent variable’s predictions.
  • Dependent variables- the target variable, which we are trying to predict.
  • Logistic function- represents the likelihood of the dependent variable being 1 or 0.

    y=e^((b0+b1x) )/(1+e^((b0+b1x)))

  • Odds Ratio – represents the odds that an outcome will occur given a particular exposure compared to the odds of the outcome occurring in the absence of that exposure. If the OR is greater than 1, the continuous variable increases, and the event is more likely to occur. On the other hand, if OR is less than 1, as the variable increases, the event is less likely to occur.
  • Log-odds- is the natural logarithm of the odds ratio, the coefficient normalized by the standard error, and tells us how likely is that something in particular will happen.
  • Coefficient- shows how the independent and dependent variables are related.
  • Intercept- represents the log odds when all independent variables are equal to zero.
  • Maximum Likelihood Estimation (MLE)- is a method used to estimate the coefficients of the logistic regression model.

Logistic Regression Model Evaluation

  • Accuracy- provides the proportion of correctly classified instances.
  • Precision- focuses on the accuracy of positive predictions.
  • Recall (Sensitivity or True Positive Rate)- measures the proportion of correctly predicted positive instances among all actual positive instances.
  • The F1 Score is the harmonic means of precision and recall.
  • Area Under the Receiver Operating Characteristic Curve (AUC-ROC)- The ROC curve plots the true positive rate against the false positive rate at various thresholds. AUC-ROC measures the area under this curve, providing an aggregate measure of a model’s performance across different classification thresholds.
  • Area Under the Precision-Recall Curve (AUC-PR)- measures the area under the precision-recall curve, providing a summary of a model’s performance across different precision-recall trade-offs.
  • Threshold Setting- the setting of the threshold value is important in logistic regression, and it is dependent on the classification problem itself, which is affected by the values of precision and recall. The arguments used to decide it are Low Precision/High Recall (to reduce the number of false negatives without necessarily reducing the number of positives) or High Precision/Low Recall (to reduce the number of false positives without necessarily reducing the number of false negatives)

I have a small favor to ask, if you find this information useful, I ask that you share this blog with other business owners that might find this content useful as well. I will be setting a lot of effort towards posting regular content to help share knowledge about all things related to business and how data analytics can be used to improve companies. Thank you!