Fisseha Berhane, PhD

Data Scientist

443-970-2353 fisseha@jhu.edu CV Resume Linkedin GitHub twitter twitter

Gradient descent and normal equation in Linear Regression

In this post, I will show how to implement linear regression with Matlab using both gradient descent and normal equation techniques. We will see linear regression with one variable and with multiple variables. The data is from the Machine Learning course on Coursera. I did this as an assignment in that course. So, if you are taking that course, you are advised not to copy from this page.

Linear regression with one variable

To show implimentation of linear regression, let's predict profits as a function of population size.

Load and explore data

In [2]:
data = load('ex1data1.txt');
X = data(:, 1); y = data(:, 2);
m = length(y); % number of training examples

Let's visualiza the data using simple plot

In [3]:
plot(X, y,'rx','Markersize',8);         % plot the data
ylabel('Profit in $10,000s','FontSize',8);           % Set the y-axis label
xlabel('Population of City in 10,000s','FontSize',8) % Set the x-axis label
title('Figure 1: Scatter plot of training data','FontSize',10) % Set the title

Gradient descent

Let's use gradient descent to find the optimal value of the coefficient.

To impliment gradient descent, we need to calculate the cost, which is given by:

J($\theta$) = $\frac{1}{2m}\sum\limits_{i=1}^{m} {(h_\theta(x^i)-y^i)^2} $

where the hypothesis $h_\theta$ is given by the linear model

$h_\theta$ = $\theta^Tx$ = $\theta_0+\theta_1x_1$

In this post, we are using batch gradient descent. In batch gradient descent, each iteration performs the update

$\theta_j:=\theta_j-\alpha\frac{1}{m}\sum\limits_{i=1}^{m} {(h_\theta(x^i)-y^i)x_j^i}$

The function below computes the cost of a particular choice of theta.

In [ ]:
function J = computeCost(X, y, theta)

m = length(y); % number of training examples

J = 0;

for i=1:m;
J=J+1./(2*m)*(theta'*X(i,:)'-y(i)).^2;
end
In [5]:
X = [ones(m, 1), data(:,1)]; % Add a column of ones to x
theta = zeros(2, 1); % initialize fitting parameters

% Some gradient descent settings
iterations = 1500;
alpha = 0.01;

% compute and display initial cost
computeCost(X, y, theta)
ans =

   32.0727

The function below helps to Perform gradient descent to learn theta.

In [ ]:
function [theta, J_history] = gradientDescent(X, y, theta, alpha, num_iters)
m = length(y); % number of training examples
J_history = zeros(num_iters, 1);

for iter = 1:num_iters;
       z1=0;z2=0;
       for i=1:m;
           z1=z1+1./m*(theta'*X(i,:)'-y(i));
           z2= z2+1./m*(theta'*X(i,:)'-y(i)).*X(i,2);
       end
theta(1)=theta(1)-alpha*z1;
theta(2)=theta(2)-alpha*z2;
   
J_history(iter) = computeCost(X, y, theta);

end

end
In [6]:
% run gradient descent
theta = gradientDescent(X, y, theta, alpha, iterations);
In [7]:
plot(X, y,'rx','Markersize',8);         % plot the data
ylabel('Profit in $10,000s','FontSize',8);           % Set the y-axis label
xlabel('Population of City in 10,000s','FontSize',8) % Set the x-axis label
title('Figure 1: Scatter plot of training data','FontSize',10) % Set the title
hold on; % keep previous plot visible
p2= plot(X(:,2), X*theta, '-');
xlim([5 25])
legend([p2], 'Linear regression','Location','southeast')
legend('boxoff')

Now, let's predict values for population sizes of 35,000 and 70,000.

In [10]:
predict1 = [1, 3.5] *theta;
predict2 = [1, 7] * theta;

Visualizing J(theta_0, theta_1)

In [9]:
% Grid over which we will calculate J
theta0_vals = linspace(-10, 10, 100);
theta1_vals = linspace(-1, 4, 100);

% initialize J_vals to a matrix of 0's
J_vals = zeros(length(theta0_vals), length(theta1_vals));

% Fill out J_vals
for i = 1:length(theta0_vals)
    for j = 1:length(theta1_vals)
     t = [theta0_vals(i); theta1_vals(j)];    
   J_vals(i,j) = computeCost(X, y, t);
    end
end
In [28]:
% Because of the way meshgrids work in the surf command, we need to 
% transpose J_vals before calling surf, or else the axes will be flipped
J_vals = J_vals';
surf(theta0_vals, theta1_vals, J_vals);
zlim([0,800])
xlabel('\theta_0'); ylabel('\theta_1');
In [18]:
contour(theta0_vals, theta1_vals, J_vals, logspace(-2, 3, 20))
xlabel('\theta_0'); ylabel('\theta_1');
hold on;
plot(theta(1), theta(2), 'rx', 'MarkerSize', 10, 'LineWidth', 2);

Linear regression with multiple variables: predicting the prices of houses

In this part, let's predict the prices of houses using size of the house (in square feet) and the number of bedrooms.

In [11]:
%% Load Data
clear ; close all; clc

data = load('ex1data2.txt');
X = data(:, 1:2);
y = data(:, 3);
m = length(y);

The function below is used for feature normalization.

In [ ]:
function [X_norm, mu, sigma] = featureNormalize(X)
X_norm = X;
mu = zeros(1, size(X, 2));
sigma = zeros(1, size(X, 2));

for i=1:size(X,2);

mu(i) = mean(X(:,i));
sigma(i) = std(X(:,i));


X_norm(:,i) =(X(:,i)-mu(i))./sigma(i);

end
end

Scale features and set them to zero mean

In [12]:
[X mu sigma] = featureNormalize(X);

% Add intercept term to X
X = [ones(m, 1) X];

The function below computes cost for multiple variables.

In [ ]:
function J = computeCostMulti(X, y, theta)

m = length(y); 

J = 0;

for i=1:m;
J=J+1./(2*m)*(theta'*X(i,:)'-y(i)).^2;
end
end

In multivariate case, the cost function can also be written in the following vectorized form

J($\theta$) = $\frac{1}{2m}(X\theta-y^→)^T(X\theta-y^→)$

The function below is gradient descent for multiple variables.

In [ ]:
function [theta, J_history] = gradientDescentMulti(X, y, theta, alpha, num_iters)

m = length(y); % number of training examples
J_history = zeros(num_iters, 1);

for iter = 1:num_iters;

z = zeros(size(X, 2),1);

for j=1:size(X,2);
       for i=1:m;
           z(j)=z(j)+1./m*(theta'*X(i,:)'-y(i)).*X(i,j)^min(j-1,1);
       end
end

for i=1:size(X,2);
theta(i)=theta(i)-alpha*z(i);

end

  
    J_history(iter) = computeCostMulti(X, y, theta);

end

end
In [13]:
% Choose some alpha value
alpha = 0.01;
num_iters = 400;

% Init Theta and Run Gradient Descent 
theta = zeros(3, 1);
[theta, J_history] = gradientDescentMulti(X, y, theta, alpha, num_iters);

Plot the convergence graph

In [14]:
plot(1:numel(J_history), J_history, '-b', 'LineWidth', 2);
xlabel('Number of iterations');
ylabel('Cost J');

Using Normal equation

Instead of gradient descent, we can also use normal equation. The function below uses normal equation to get the coefficients.

In [ ]:
function [theta] = normalEqn(X, y)

theta = zeros(size(X, 2), 1);

theta=pinv(X'*X)*X'*y;

end
In [15]:
% Calculate the parameters from the normal equation
theta = normalEqn(X, y);
In [16]:
% Display normal equation's result
fprintf('Theta computed from the normal equations: \n');
fprintf(' %f \n', theta);
fprintf('\n');
Theta computed from the normal equations: 
 340412.659574 
 110631.050279 
 -6649.474271

Now, let's estimate the price of a 1650 sq-ft, 3 br house.

In [18]:
price = [1,1650,3]*theta;
In [30]:
fprintf(['Predicted price of a 1650 sq-ft, 3 br house ' ...
         '(using normal equations):\n $%f\n'], round(price,0));
Predicted price of a 1650 sq-ft, 3 br house (using normal equations):
 $182861697.000000

Summary

This post is on linear regression using both normal equation and gradient descent techniques. We used single variable to predict profit and multiple variables to predict house prices. Generally, we use normal equation if the size of the data set is not huge but if it is, we have to use gradient descent because normal equation is computationally intensive and it complexity is O(n^3).



comments powered by Disqus