Logistic regression is used to obtain the odds ratio when it exists more than one explanatory variable with a binomial response variable. The aim is to analyze the impact of each variable of the observed event of interest.
How Does Logistic Regression Work?
A logistic regression model will give the chance (The Ratio) of an outcome based on individual characteristics. The ratio is the algorithm of the chance given by:
log(π/(1-π))=β_0+β_1 x_1 + β_2 x_2+ ...+β_m x_m
π indicates the probability of an event
βi are the regression coefficients associated with the reference group.
xi explanatory variables
β0 represents the reference group, which are those individuals presenting the reference level of each and every variable x1…m.
Let’s walk through an example from a fictional study where the effects of two drug treatments to Staphylococcus Aureus (SA) endocarditis were compared (table 1).
Table 1 Results from a fictional endocarditis treatment study by McHugh
Standard Treatment | New Treatment | Totals | |
---|---|---|---|
Died | 152 | 17 | 169 |
Survived | 248 | 103 | 351 |
Totals | 400 | 120 | 520 |
The odd ratio (OR) of death of patients using standard treatments: (152x103)/(248x47) = 3.71, which means that patients with standard treatment present a chance to die 3.71 times greater than patients with new treatment. More complex problems may arise when we are interested in the relationship between two or more explanatory variables and one response variable (table 2).
Table 2 Results from a fictional endocarditis treatment study by McHugh looking at age
Younger (30-45 yrs.) | Older (46-60 yrs.) | Totals | |
---|---|---|---|
Died | 120 | 49 | 169 |
Survived | 217 | 134 | 351 |
Totals | 337 | 183 | 520 |
OR (120x134)/(217x49) =1.51 meaning that the chance of a younger individual between 30x45 years old death is about 105 times the chance of the death of an older individual between 46 and 60 years old.
Now we have two variables related to the event of interest (death) in individuals with SA endocarditis.
Table 3 represents the events of the effect of treatment on endocarditis by age.
Table 3 Effect of treatment on endocarditis stratified by age.
Standard Treatment | New Treatment | Totals | OR | ||
---|---|---|---|---|---|
Older (46-60 yrs.) | Died | 43 | 6 | 49 | 2.44 |
Survived | 100 | 34 | 134 | ||
Totals | 143 | 40 | 183 |
Standard Treatment | New Treatment | Totals | OR | ||
---|---|---|---|---|---|
Younger (30–45 yrs.) | Died | 109 | 11 | 120 | 4.62 |
Survived | 148 | 69 | 217 | ||
Totals | 257 | 80 | 337 |
Table 3 shows that the impact of treatment is higher on younger individuals because the OR in the younger patients is higher than in the older patients’ subgroups. The problem here is that it would be incorrect to look at the treatment results without considering the patient's age. To solve this problem, we should calculate the “weighted” OR by using for example Mantel-Haenszel OR equation, where n is the sample size of age class I, and a, b, c, and d are the table cells, as presented by McHugh.
Mantel-Haenszel OR = ∑ (iaidi / ni) / ∑ (ibici / ni)
The results of the weighted chance of death associated with standard treatment is 3.74 times the chance of death of individuals taking new treatments. When the number of variables increases the calculations become more complicated, and when using continuous variables like age it is necessary to set a breaking point to categorize (in this example it was set up arbitrarily at 45 years old). A better approach would be to use logistic regression.
Let’s apply logistic regression to this example, which is a “saturated model” because it includes all variables.
Table 4 Results from multivariate logistic regression model containing all explanatory variables (full model).
Term | β Estimate | Standard Error | P Value |
---|---|---|---|
Intercept (β 0) | -2.121 | 0.303 | <0.001 |
Age: Younger (β 1) | 0.454 | 0.207 | 0.028 |
Treatment: Standard (β 2) | 1.333 | 0.283 | <0.001 |
β0: the intercept, exp(β0) = exp(-2.121) = 0.12 is the chance of death among those individuals that are older and received new treatment.
β1: individuals that are younger, exp(β1) = exp(0.454) = 1.58. Older individuals that receive standard treatment have a mean chance to die of 1.58 times of reference individuals.
β2: individuals that are older, exp(β2) = exp(1.333) = 3.79. Older individuals that receive standard treatment have a mean chance to die of 3.79 times of reference individuals.
If individuals are younger and received standard treatment, then we calculate exp(β1 + β2) = exp(1.787) = 5.97 times the mean chance of reference individuals.
This is a basic interpretation of a logistic regression model, but some issues can happen during the analysis and the results might not be readily available. It is important to pay attention when constructing the model, by avoiding feeding the software with raw data without making decisions.
Logistic Regression can be used in business to solve binary classification problems, like predicting customer churn, fraud detection, medical diagnosis, credit risk assessment, market segmentation, and employee retention.
Logistic Regression Application in the Construction Industry
Quality Control in Construction:
Scenario: A construction company wants to ensure the quality of its products (e.g., concrete blocks, steel beams, or prefabricated components).
Application: Logistic regression can predict whether a product meets quality standards based on features like material composition, dimensions, and manufacturing process parameters. By analyzing historical data, the model can identify factors that contribute to defects or non-compliance. This information helps improve production processes and reduce waste.
Safety Compliance Prediction:
Scenario: A construction site aims to prevent accidents and ensure compliance with safety regulations.
Application: Logistic regression can analyze safety-related variables (e.g., worker experience, equipment usage, weather conditions) to predict the likelihood of safety violations or accidents. By identifying high-risk situations, safety protocols can be reinforced, and preventive measures can be implemented.
Project Delay Prediction:
Scenario: Construction projects often face delays due to unforeseen circumstances.
Application: Logistic regression can assess project-related factors (e.g., weather, resource availability, subcontractor performance) to predict the likelihood of delays. By understanding critical risk factors, project managers can allocate resources effectively and mitigate potential delays.
Bid Acceptance Probability:
Scenario: Construction firms submit bids for projects, and winning bids are crucial for business growth.
Application: Logistic regression can analyze bid-related features (e.g., bid amount, project complexity, competitor bids) to estimate the probability of winning a contract. By optimizing bidding strategies, companies can increase their chances of securing profitable projects.
Equipment Maintenance Prediction:
Scenario: Construction equipment (e.g., cranes, and excavators) requires regular maintenance to prevent breakdowns.
Application: Logistic regression can predict the likelihood of equipment failure based on usage patterns, maintenance history, and environmental conditions. By scheduling preventive maintenance when the risk is high, companies can minimize downtime and repair costs.
I have a small favor to ask, if you find this information useful, I ask that you share this blog with other business owners that might find this content useful as well. I will be setting a lot of effort towards posting regular content to help share knowledge about all things related to business and how data analytics can be used to improve companies. Thank you!