OLS in good old Fortran!

In this post, we will discuss the algorithm that produces point estimates \hat\beta given inputs y and X in linear regression. After deriving the function producing \hat\beta , I’ll briefly show how to implement this procedure in yet another compiled language, Fortran, and how to call it from within R.

About linear regression

Let y be a response vector and X a matrix of covariates, \beta a vector of population parameters, and \epsilon a vector of error terms. Let furthermore n be the number of observations, k the number of parameters, y_{[n \times 1]} , X_{[n \times k]} , \beta_{[k \times 1]} , \epsilon_{[n \times 1]} and the Gauss-Markov assumptions1 hold such that

\begin{align*} %y &\sim N(X\beta, \sigma^2) \\ %\Leftrightarrow y &= X\beta + \epsilon, ~~~\epsilon \sim N(0, \sigma^2 I) \\ \Leftrightarrow \epsilon &= y - X\beta \end{align*}

Unfortunately, we do not know the actual population parameters \beta and \sigma^2 , and hence attempt to obtain their respective estimates \hat\beta and \hat\sigma^2 . One way to compute \hat\beta is by minimizing the sum of squared residuals

\begin{align*} e^\prime e &= (y - X\hat\beta)^\prime (y - X\hat\beta) \\ &= y^\prime y - 2y^\prime X + \hat\beta^\prime X^\prime X \hat\beta \end{align*}

Minimization requires:2

\begin{align*} \frac{\partial e^\prime e}{\partial \hat\beta} &= -2y^\prime X + 2 X^\prime X \hat\beta \overset{!}{=} 0 \\ &\Leftrightarrow 2 X^\prime X \hat\beta = 2 y^\prime X \\ &\Leftrightarrow \hat\beta = (X^\prime X)^{-1} y^\prime X \end{align*}

To assess the uncertainty around \hat\beta , we can derive their variance-covariance matrix Var(\hat\beta|X) (also see the Gauss-Markov assumptions):

\begin{align*} Var(\hat\beta|X) &= E[(\hat\beta - \beta)(\hat\beta - \beta)^\prime] \\ &= E[((X^\prime X)^{-1} X^\prime \epsilon)((X^\prime X)^{-1} X^\prime \epsilon)^\prime] \\ &= E[(X^\prime X)^{-1} X^\prime \epsilon \epsilon^\prime X (X^\prime X)^{-1}] \\ &= (X^\prime X)^{-1} X^\prime \, E[\epsilon \epsilon^\prime] \, X (X^\prime X)^{-1} \\ &= (X^\prime X)^{-1} X^\prime \, (\sigma^2 I) \, X (X^\prime X)^{-1} \\ &= (\sigma^2 I) (X^\prime X)^{-1} X^\prime X (X^\prime X)^{-1} \\ &= \sigma^2 (X^\prime X)^{-1} \end{align*}

We estimate \sigma^2 with \hat\sigma^2 = \frac{e^\prime e}{n - k} . Hence,

\begin{align*} Var(\hat\beta|X) = \frac{e^\prime e}{n - k} (X^\prime X)^{-1} \end{align*}

With \hat\beta = (X^\prime X)^{-1} y^\prime X and Var(\hat\beta|X) , we have all we need to estimate linear regressions in Fortran.

Implement in Fortran

Fortran provides two ways to call algorithms, functions and subroutines. Simply put, functions return results and do not modify their inputs, while subroutines may also modify their inputs. Since we’ll call Fortran from within R, we’ll rely on the low level interface function .Fortran() to call shared Fortran binaries. Yet, similar to R’s .C() function, .Fortran() does not allow code to create new R objects, only modification of existing ones. We’ll hence implement the linear regression algorithm as a Fortran subroutine.

The linreg subroutine takes as inputs a vector y , covariate matrix X , as well as the number of observations n and parameters k . It furthermore takes three objects to be modified by the subroutine: the parameter estimates, the variance-covariance matrix, and the vector of standard errors.

subroutine linreg(y, X, n, k, beta, vcov, se)

  ! avoid variables i, j, k, l, n to be treated as integer automatically
  implicit none

  ! declare type of input objects
  integer :: n, k
  real(8) :: y(n), X(n, k), beta(k), vcov(k, k), se(k)

  ! type of intermediary objects
  integer :: ipiv(k), info, i
  real(8) :: XpX(k, k), invXpX(k, k), work(k), Xpy(k), e(n)

  ! compute point estimates:
  ! intermediary steps
  ! (1) X'X
  XpX = matmul(transpose(X), X)

  ! (2) (X'X)^-1: invert XpX using LAPACK functions
  invXpX = XpX
  call dgetrf(k, k, invXpX, k, ipiv, info)
  call dgetri(k, invXpX, k, ipiv, work, k, info)

  ! (3) X'y
  Xpy = matmul(transpose(X), y)

  ! (4) beta = (X'X)^-1 X'y
  beta = matmul(invXpX, Xpy)

  ! compute variance-covariance matrix:
  ! var(betahat|X) = sigmahat^2 * (X'X)^-1
  ! sigmahat^2 = e'e / (n - k)
  ! e'e = (y - Xbeta)'(y - Xbeta)
  e = y - matmul(X, beta)
  vcov = dot_product(e, e) / (n - k) * invXpX

  ! compute standard errors:
  ! square root of main diagonal of variance-covariance matrix
  se = (/ (sqrt(vcov(i, i)), i = 1,k) /)

end subroutine linreg

To make linreg callable from R, we compile the above code into a shared library. Assuming the code is saved in a file named linreg.f95:

gfortran-9 linreg.f95 -o linreg.so -shared -lblas -llapack -std=f95

Run linear regression in R

We test our Fortran linear regression subroutine in R using the attitude R data frame. attitude contains aggregated responses of clerical employees of a large financial organization. Each observation corresponds to the responses of 35 employees from each department:

  rating complaints privileges learning raises critical advance
1     43         51         30       39     61       92      45
2     63         64         51       54     63       73      47
3     71         70         68       69     76       86      48
4     61         63         45       47     54       84      35
5     81         78         56       66     71       83      47
6     43         55         49       44     54       49      34

To be able to call linreg from R, we need to load our Fortran library using dyn.load(). In a next step, we prepare the input data (y, X, n, k) we need to estimate the linear regression. Furthermore, prepare empty results objects into which linreg will write the regression results. We then run both our Fortran linear regression implementation as well as R’s lm()-function.

## ---- implementing linear regression in Fortran ----
data(attitude)

## ---- prepare input data ----
# data inputs
X <- model.matrix(rating ~ ., data = attitude)
y <- attitude$rating

n <- nrow(X)
k <- ncol(X)

# results objects
beta <- double(k)
names(beta) <- colnames(X)
vcov <- matrix(0, nrow = k, ncol = k)
se <- double(k)
names(se) <- colnames(X)

## ---- run regression ----
# R
ols_R <- lm(rating ~ ., data = attitude)

# Fortran
dyn.load("linreg.so") # load Fortran library
ols_fortran <- .Fortran("linreg", y = y, X = X, n = n, k = k, beta = beta, vcov = vcov, se = se)
dyn.unload("linreg.so") # unload library

# summarize results from Fortran function
res <- cbind(ols_fortran$beta, ols_fortran$se, ols_fortran$beta/ols_fortran$se)
res <- cbind(res, 2 * pt(abs(res[, 3]), df = n - k, lower.tail = FALSE))
colnames(res) <- c("Estimate", "Std. Err", "t value", "Pr(>|t|)")

# results
summary(ols_R)
print(res)

A first glance at the results of lm() and linreg() indicate that our Fortran implementation produces estimates very similar to R’s lm().

Call:
lm(formula = rating ~ ., data = attitude)

Residuals:
     Min       1Q   Median       3Q      Max
-10.9418  -4.3555   0.3158   5.5425  11.5990

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.78708   11.58926   0.931 0.361634
complaints   0.61319    0.16098   3.809 0.000903 ***
privileges  -0.07305    0.13572  -0.538 0.595594
learning     0.32033    0.16852   1.901 0.069925 .
raises       0.08173    0.22148   0.369 0.715480
critical     0.03838    0.14700   0.261 0.796334
advance     -0.21706    0.17821  -1.218 0.235577
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.068 on 23 degrees of freedom
Multiple R-squared:  0.7326,	Adjusted R-squared:  0.6628
F-statistic:  10.5 on 6 and 23 DF,  p-value: 1.24e-05
               Estimate   Std. Err    t value     Pr(>|t|)
(Intercept) 10.78707639 11.5892572  0.9307824 0.3616337210
complaints   0.61318761  0.1609831  3.8090182 0.0009028679
privileges  -0.07305014  0.1357247 -0.5382229 0.5955939205
learning     0.32033212  0.1685203  1.9008516 0.0699253459
raises       0.08173213  0.2214777  0.3690310 0.7154800884
critical     0.03838145  0.1469954  0.2611064 0.7963342642
advance     -0.21705668  0.1782095 -1.2179862 0.2355770486

To be sure, let’s check the results more precisely with R’s all.equal() function. Note that with lm(), we extract the model’s variance-covariance matrix, and take the square root of the matrix’ main diagonal to obtain the estimates’ standard errors. Indeed all.equal() suggest that lm() and our OLS implementation produce the same results.3

# compare coefficients from lm() and linreg()
all.equal(coef(ols_R), res[, 1])

# compare standard errors
all.equal(sqrt(diag(vcov(ols_R))), res[, 2])
[1] TRUE
[1] TRUE

Concluding remarks

This post describes in general terms how to compute parameter estimates using ordinary least squares. As we have seen, the algorithm behind linear regression essentially relies on basic matrix algebra and high-school calculus. More specifically, it offers a closed-form equation to directly compute parameter estimates: no starting values or iterative computations necessary.

Assuming some basic programming experience, implementing linear regression in Fortran has proven quite straightforward as well. Indeed, Fortran – unlike C – comes built-in with vectors, matrices, and various linear algebra functions, such that implementing linear regression almost4 boiled down to writing \hat\beta = (X^\prime X)^{-1}y^\prime X in Fortran code.


  1. Linear relation between y and X ( y = X\beta + \epsilon ); X has full rank; Zero conditional mean ( E[\epsilon|X] = 0 ); Constant and uncorrelated error variance ( Var(\epsilon|X) = E[\epsilon \epsilon^\prime|X] = \sigma^2 I ). [return]
  2. For completeness, minimization also requires \frac{\partial^2 e^\prime e}{\partial\hat\beta} > 0 , which is the case when X has full rank. [return]
  3. Note that all.equal() only tests for “near equality”, allowing numbers to differ with a tolerance of about 1.5 × 10-8 (approximately seven significant digits of precision). This should still be good enough for our purposes. [return]
  4. Some may very well beg to differ. [return]