Forecasting with Relationships: Correlation and Regression
This chapter explores the use of correlation and regression in forecasting, essential for management accounting. It covers calculating and interpreting…
Learning objectives
- Calculate and interpret the correlation coefficient to judge the direction and strength of a linear relationship between two variables.
- Build a simple linear regression model to predict a cost or revenue figure from a single activity driver.
- Assess how well a regression line fits the data using r-squared and a structured review of residuals.
- Recognise forecasting risks including outliers, non-linearity, and the dangers of extrapolating beyond the data range.
- Apply correlation and regression results in a practical budgeting and planning context.
Overview & key concepts
Forecasting in management accounting often starts with a simple question: does a change in an activity driver tend to be associated with a change in cost or revenue? Examples include machine hours and maintenance cost, deliveries and distribution cost, or website visits and sales orders.
Correlation and regression help quantify relationships using observed data. They support more defensible forecasts than “rule of thumb” estimates, but they must be applied with judgement.
These tools inform planning and budgeting; they do not create accounting entries and do not directly change reported assets, liabilities, or equity.
Correlation coefficient (r)
What r measures
The correlation coefficient, r, summarises how strongly two variables move together in a straight-line (linear) sense:
- r = +1: perfect positive linear relationship
- r = –1: perfect negative linear relationship
- r ≈ 0: no meaningful linear pattern (a non-linear relationship may still exist)
Correlation does not prove cause-and-effect. It describes co-movement in the observed data.
Positive and negative correlation
- Positive correlation: higher X tends to occur with higher Y (e.g., more machine hours with higher maintenance cost).
- Negative correlation: higher X tends to occur with lower Y (e.g., higher price with lower quantity sold, in some markets).
No (linear) correlation
If r is close to zero, a straight-line relationship is weak. This may occur because the relationship is genuinely weak, because it is non-linear, or because other drivers are influencing Y.
A scatter plot is a useful first step before running calculations.
Simple linear regression
The fitted prediction equation
Simple linear regression fits a straight line to the observed data to predict a dependent variable Y from an independent variable X:
Ŷ = a + bX
- Ŷ: predicted (forecast) value of Y
- Y: actual observed value
- X: driver used to predict Y
- b (slope): estimated change in Ŷ for a one-unit increase in X
- a (intercept): predicted value of Ŷ when X = 0
(Conceptually, you may think of the underlying relationship as Y ≈ a + bX, but forecasting uses the fitted equation Ŷ = a + bX.)
Interpreting slope and intercept in business terms
In many cost-estimation settings:
- Intercept (a)can be treated as a baseline element of cost (often described as a fixed element).
- Slope (b)can be treated as the variable cost rate per unit of activity (e.g., £ per machine hour).
This interpretation is only sensible if the operating range makes X = 0 meaningful or at least not wildly unrealistic. If X = 0 is outside any plausible operating range, treat the intercept as a fitted parameter rather than a literal cost at zero activity.
A useful way to phrase this in performance management is: treat the fitted line as a decision aid, not a law of nature—relationships can shift when operating conditions change.
Residuals and r-squared
Residuals (errors)
A residual is:
Residual = Y − Ŷ
Residuals help assess whether the model is systematically wrong. A useful model typically produces residuals that:
- are mixed positive and negative
- show no clear pattern as X increases
- are not dominated by one or two extreme points
Residual plots are often more informative than a table because patterns are easier to see visually (curvature, widening spread, long runs of the same sign).
r-squared (coefficient of determination)
r-squared (r²) indicates how much of the variation in Y is explained by the fitted line. It ranges from 0 to 1.
In simple linear regression with an intercept, r² equals the square of the Pearson correlation coefficient (r).
A high r² is helpful, but it is not proof that the relationship will hold in the future.
Outliers and extrapolation
Outliers
An outlier is an observation that sits unusually far from the overall pattern. Outliers can distort r, the slope, and the intercept. Always ask why the point is unusual: an operational event, a one-off cost, or a data error.
Extrapolation
Extrapolation means forecasting outside the range of observed X values. It is risky because relationships can change beyond the data range due to step costs, capacity limits, or operational changes.
Where extrapolation is unavoidable, present the forecast with scenarios or sensitivity ranges.
Core theory and frameworks
Deciding on a relationship model
Before applying correlation or regression:
- Define Y and X clearly (what is being predicted, and what driver is being used).
- Check units and logic (the driver should plausibly influence the outcome).
- Plot the data (scatter plot) to assess whether a straight line is reasonable.
- Check data consistency (same time period, consistent cost classification, comparable conditions).
Calculating correlation (Pearson’s r)
Correlation measures how tightly the data points cluster around a straight line. It is calculated from paired observations (X, Y) using:
r = [ n∑XY − (∑X)(∑Y) ] / √( [ n∑X² − (∑X)² ] × [ n∑Y² − (∑Y)² ] )
Where n is the number of paired observations.
A practical way to work is to compute:
- the numerator first (how X and Y move together), then
- divide by the square root of the product of the two variability terms.
How to interpret r
- Sign: “+” means X and Y tend to rise together; “–” means they tend to move in opposite directions.
- Magnitude: values closer to 1 in absolute terms indicate a stronger linear pattern.
- Sense-check: always test plausibility; correlation alone does not prove cause.
Building a simple linear regression model
The fitted prediction equation is:
Ŷ = a + bX
Slope:
b = [ n∑XY − (∑X)(∑Y) ] / [ n∑X² − (∑X)² ]
Intercept:
a = [ ∑Y − b∑X ] / n
Then the forecast for any chosen X value is:
Ŷ = a + bX
Evaluating model quality
Quick model review (exam-friendly):
- Fit: does r² indicate the driver explains a meaningful share of movement in Y?
- Errors: do residuals look small and pattern-free (no curvature, no widening spread, no long runs of the same sign)?
- Business meaning: are the implied baseline cost (a) and rate (b) sensible in units and scale?
- Context stability: have policy, capacity, equipment, product mix, or operating conditions changed in a way that could break the relationship?
- Range: is the forecast inside the observed X range? If not, state the extrapolation risk and use scenarios.
Exam technique focus
In a correlation/regression forecasting question, strong answers typically:
- define variables clearly (Y = dependent,X = independent)
- state thedirection of the relationship(positive or negative) and what that means for forecasts
- show the method to computeband thena
- state the fitted equation neatly (Ŷ = a + bX) and use it to forecast
- interpretbin units (e.g., £ per machine hour) andaas a baseline estimate (with caution if X = 0 is unrealistic)
- comment onrand/orr²to support the strength of the relationship
- state limitations (outliers, non-linearity, extrapolation, missing drivers, operational change)
Worked example
Narrative scenario
A manufacturing company records machine hours and total maintenance cost over six months:
- Month 1: 120 machine hours, £1,180
- Month 2: 150 machine hours, £1,290
- Month 3: 180 machine hours, £1,430
- Month 4: 210 machine hours, £1,520
- Month 5: 240 machine hours, £1,650
- Month 6: 270 machine hours, £1,790
The company wants to forecast maintenance cost at 300 machine hours.
Required
- Calculate the correlation coefficient (r) between machine hours and maintenance cost.
- Build a simple linear regression model.
- Evaluate the model using r-squared and residual analysis.
- Forecast maintenance cost at 300 machine hours.
- Identify outliers and any extrapolation risk.
Solution
Step 1: Summary totals
Let X = machine hours, Y = maintenance cost, n = 6.
- ∑X = 1,170
- ∑Y = 8,860
- ∑X² = 243,900
- ∑Y² = 13,338,400
- ∑XY = 1,791,000
Step 2: Correlation coefficient (r)
r = [ n∑XY − (∑X)(∑Y) ] / √( [ n∑X² − (∑X)² ] × [ n∑Y² − (∑Y)² ] )
Numerator:
- n∑XY − (∑X)(∑Y)
- = 6×1,791,000 − (1,170×8,860)
- = 10,746,000 − 10,366,200
- =379,800
Denominator components:
- n∑X² − (∑X)²
- = 6×243,900 − 1,170²
- = 1,463,400 − 1,368,900
- =94,500
- n∑Y² − (∑Y)²
- = 6×13,338,400 − 8,860²
- = 80,030,400 − 78,499,600
- =1,530,800
Denominator:
- √(94,500 × 1,530,800) ≈380,343
Therefore:
- r = 379,800 / 380,343 ≈0.9986
Interpretation: r is positive, so higher machine hours are associated with higher maintenance cost. The magnitude is close to 1, so the linear relationship is extremely strong in this dataset.
Step 3: Regression model (Ŷ = a + bX)
3.1 Slope (b)
b = [ n∑XY − (∑X)(∑Y) ] / [ n∑X² − (∑X)² ]
- b = 379,800 / 94,500 =4.0190(approx.)
Interpretation: predicted maintenance cost increases by about £4.02 per machine hour.
3.2 Intercept (a)
a = [ ∑Y − b∑X ] / n
- a = [ 8,860 − (4.0190 × 1,170) ] / 6
- ≈ [ 8,860 − 4,702.29 ] / 6
- ≈692.95
So the fitted prediction equation is:
Ŷ = 692.95 + 4.0190X
Interpretation: the fitted line suggests a baseline monthly maintenance cost of about £693, plus about £4.02 per machine hour.
Step 4: Model evaluation (r-squared and residuals)
4.1 r-squared
In simple linear regression with an intercept:
r² = (r)²
- r² = (0.9986)² ≈0.9971
Interpretation: about 99.7% of the variation in maintenance cost is explained by machine hours in this dataset.
4.2 Residuals
Using Ŷ = 692.95 + 4.0190X:
| Month | X (hours) | Actual Y (£) | Predicted Ŷ (£) | Residual (Y − Ŷ) (£) |
|---|---|---|---|---|
| 1 | 120 | 1,180 | 1,175.24 | +4.76 |
| 2 | 150 | 1,290 | 1,295.81 | −5.81 |
| 3 | 180 | 1,430 | 1,416.38 | +13.62 |
| 4 | 210 | 1,520 | 1,536.95 | −16.95 |
| 5 | 240 | 1,650 | 1,657.52 | −7.52 |
| 6 | 270 | 1,790 | 1,778.10 | +11.90 |
Residuals are small and show no obvious pattern. A residual plot would confirm this quickly and may reveal curvature or widening spread more clearly than a table.
Step 5: Forecast at 300 machine hours
Ŷ = 692.95 + 4.0190(300)
= 692.95 + 1,205.70
= 1,898.65
Forecast maintenance cost at 300 hours: approximately £1,899.
Step 6: Outliers and extrapolation risk
- Outliers:none are obvious from the residuals; no single month appears to dominate the fitted line.
- Extrapolation:300 hours is above the observed maximum (270 hours). This is a modest extrapolation, but the risk should still be stated (step costs, capacity effects, or policy changes could alter the relationship).
Common pitfalls and misunderstandings
- Confusing correlation with causation
- Forcing a linear model onto a non-linear pattern
- Using inconsistent or poor-quality data
- Ignoring outliers or unusual periods
- Over-relying on r-squared without checking residual behaviour
- Extrapolating without stating uncertainty
- Selecting a driver that lacks operational logic
- Assuming the relationship will remain stable despite operational change
Summary and further reading
Correlation and regression are practical tools for forecasting and cost estimation. Correlation summarises the strength and direction of a linear relationship. Regression produces a fitted prediction equation that can be used to estimate a cost or revenue figure from a driver.
Strong application requires both calculation and judgement: confirm the relationship visually, compute the model correctly, evaluate fit using r-squared and residual behaviour, and communicate limitations—especially outliers and extrapolation risk.
FAQ
What is the difference between correlation and causation?
Correlation describes whether two variables tend to move together. Causation means one factor produces changes in the other. In business settings, causation needs operational evidence beyond a correlation statistic.
How can outliers affect a regression model?
Outliers can pull the fitted line, changing the slope and intercept and therefore changing the forecast. Always investigate whether an outlier reflects an abnormal event or a data issue before deciding how to treat it.
Why is extrapolation risky?
The relationship may change beyond the observed range because of step costs, capacity constraints, or different operating methods. If extrapolation is unavoidable, present scenarios or sensitivity ranges.
What does a high r-squared value indicate?
It suggests the fitted line explains a large share of movement in Y within the dataset. It does not prove the relationship will remain stable or that the driver is the only relevant factor.
How does residual analysis improve a model?
Residual analysis shows how the model is wrong across the data range. Patterns suggest non-linearity, missing variables, or changing variance. Residual plots often make these issues easier to spot than tables.
What are the main limitations of simple linear regression?
It uses one driver and assumes a straight-line relationship. Many real costs and revenues are influenced by multiple factors and may include step changes or non-linear effects.
Summary (Recap)
This chapter explained how correlation and simple linear regression can be used to forecast costs and revenues from business drivers. It covered the calculation and interpretation of the correlation coefficient, how to build a fitted prediction equation using slope and intercept, and how to evaluate model quality using r-squared and residual checks. It also highlighted key risks—outliers, non-linearity, and extrapolation—and demonstrated a complete worked example with a forecast beyond the observed range.
Glossary
Correlation coefficient (r)
A number from –1 to +1 showing the direction and strength of a linear relationship between two variables.
Positive correlation
Higher values of one variable tend to be observed with higher values of the other (r > 0).
Negative correlation
Higher values of one variable tend to be observed with lower values of the other (r < 0).
No (linear) correlation
Little or no straight-line pattern (r near 0), though a non-linear relationship may still exist.
Simple linear regression
A method that fits a straight line to predict a dependent variable from one independent variable: Ŷ = a + bX.
Dependent variable (Y)
The actual observed value being modelled (e.g., total cost or revenue).
Independent variable (X)
The driver used to predict Y (e.g., machine hours or units sold).
Slope (b)
The estimated change in predicted Y (Ŷ) for a one-unit increase in X.
Intercept (a)
The predicted value of Ŷ when X = 0 (interpret cautiously if X = 0 is outside the realistic operating range).
Residual
Actual value minus predicted value (Y − Ŷ). Used to judge model fit and identify patterns.
r-squared (r²)
A measure from 0 to 1 indicating how much of the variation in Y is explained by the fitted line; in simple regression with an intercept, r² = (r)².
Outlier
An observation that is unusually distant from the overall pattern and may distort results.
Extrapolation
Forecasting beyond the observed range of X values, which increases the risk that the relationship will not hold.
Written by
AccountingBody Editorial Team
Continue Learning