Fisseha Berhane, PhD

Data Scientist

443-970-2353 fisseha@jhu.edu CV Resume

Gradient descent and normal equation in Linear Regression¶

In this post, I will show how to implement linear regression with Matlab using both gradient descent and normal equation techniques. We will see linear regression with one variable and with multiple variables. The data is from the Machine Learning course on Coursera. I did this as an assignment in that course. So, if you are taking that course, you are advised not to copy from this page.

Linear regression with one variable¶

To show implimentation of linear regression, let's predict profits as a function of population size.

In [2]:
data = load('ex1data1.txt');
X = data(:, 1); y = data(:, 2);
m = length(y); % number of training examples


Let's visualiza the data using simple plot

In [3]:
plot(X, y,'rx','Markersize',8);         % plot the data
ylabel('Profit in $10,000s','FontSize',8); % Set the y-axis label xlabel('Population of City in 10,000s','FontSize',8) % Set the x-axis label title('Figure 1: Scatter plot of training data','FontSize',10) % Set the title  Gradient descent¶ Let's use gradient descent to find the optimal value of the coefficient. To impliment gradient descent, we need to calculate the cost, which is given by: J($\theta$) =$\frac{1}{2m}\sum\limits_{i=1}^{m} {(h_\theta(x^i)-y^i)^2} $where the hypothesis$h_\theta$is given by the linear model$h_\theta$=$\theta^Tx$=$\theta_0+\theta_1x_1$In this post, we are using batch gradient descent. In batch gradient descent, each iteration performs the update$\theta_j:=\theta_j-\alpha\frac{1}{m}\sum\limits_{i=1}^{m} {(h_\theta(x^i)-y^i)x_j^i}$The function below computes the cost of a particular choice of theta. In [ ]: function J = computeCost(X, y, theta) m = length(y); % number of training examples J = 0; for i=1:m; J=J+1./(2*m)*(theta'*X(i,:)'-y(i)).^2; end  In [5]: X = [ones(m, 1), data(:,1)]; % Add a column of ones to x theta = zeros(2, 1); % initialize fitting parameters % Some gradient descent settings iterations = 1500; alpha = 0.01; % compute and display initial cost computeCost(X, y, theta)  ans = 32.0727  The function below helps to Perform gradient descent to learn theta. In [ ]: function [theta, J_history] = gradientDescent(X, y, theta, alpha, num_iters) m = length(y); % number of training examples J_history = zeros(num_iters, 1); for iter = 1:num_iters; z1=0;z2=0; for i=1:m; z1=z1+1./m*(theta'*X(i,:)'-y(i)); z2= z2+1./m*(theta'*X(i,:)'-y(i)).*X(i,2); end theta(1)=theta(1)-alpha*z1; theta(2)=theta(2)-alpha*z2; J_history(iter) = computeCost(X, y, theta); end end  In [6]: % run gradient descent theta = gradientDescent(X, y, theta, alpha, iterations);  In [7]: plot(X, y,'rx','Markersize',8); % plot the data ylabel('Profit in$10,000s','FontSize',8);           % Set the y-axis label
xlabel('Population of City in 10,000s','FontSize',8) % Set the x-axis label
title('Figure 1: Scatter plot of training data','FontSize',10) % Set the title
hold on; % keep previous plot visible
p2= plot(X(:,2), X*theta, '-');
xlim([5 25])
legend([p2], 'Linear regression','Location','southeast')
legend('boxoff')


Now, let's predict values for population sizes of 35,000 and 70,000.

In [10]:
predict1 = [1, 3.5] *theta;
predict2 = [1, 7] * theta;


Visualizing J(theta_0, theta_1)¶

In [9]:
% Grid over which we will calculate J
theta0_vals = linspace(-10, 10, 100);
theta1_vals = linspace(-1, 4, 100);

% initialize J_vals to a matrix of 0's
J_vals = zeros(length(theta0_vals), length(theta1_vals));

% Fill out J_vals
for i = 1:length(theta0_vals)
for j = 1:length(theta1_vals)
t = [theta0_vals(i); theta1_vals(j)];
J_vals(i,j) = computeCost(X, y, t);
end
end

In [28]:
% Because of the way meshgrids work in the surf command, we need to
% transpose J_vals before calling surf, or else the axes will be flipped
J_vals = J_vals';
surf(theta0_vals, theta1_vals, J_vals);
zlim([0,800])
xlabel('\theta_0'); ylabel('\theta_1');

In [18]:
contour(theta0_vals, theta1_vals, J_vals, logspace(-2, 3, 20))
xlabel('\theta_0'); ylabel('\theta_1');
hold on;
plot(theta(1), theta(2), 'rx', 'MarkerSize', 10, 'LineWidth', 2);


Linear regression with multiple variables: predicting the prices of houses¶

In this part, let's predict the prices of houses using size of the house (in square feet) and the number of bedrooms.

In [11]:
%% Load Data
clear ; close all; clc

X = data(:, 1:2);
y = data(:, 3);
m = length(y);


The function below is used for feature normalization.

In [ ]:
function [X_norm, mu, sigma] = featureNormalize(X)
X_norm = X;
mu = zeros(1, size(X, 2));
sigma = zeros(1, size(X, 2));

for i=1:size(X,2);

mu(i) = mean(X(:,i));
sigma(i) = std(X(:,i));

X_norm(:,i) =(X(:,i)-mu(i))./sigma(i);

end
end


Scale features and set them to zero mean

In [12]:
[X mu sigma] = featureNormalize(X);

% Add intercept term to X
X = [ones(m, 1) X];


The function below computes cost for multiple variables.

In [ ]:
function J = computeCostMulti(X, y, theta)

m = length(y);

J = 0;

for i=1:m;
J=J+1./(2*m)*(theta'*X(i,:)'-y(i)).^2;
end
end


In multivariate case, the cost function can also be written in the following vectorized form

J($\theta$) = $\frac{1}{2m}(X\theta-y^→)^T(X\theta-y^→)$

The function below is gradient descent for multiple variables.

In [ ]:
function [theta, J_history] = gradientDescentMulti(X, y, theta, alpha, num_iters)

m = length(y); % number of training examples
J_history = zeros(num_iters, 1);

for iter = 1:num_iters;

z = zeros(size(X, 2),1);

for j=1:size(X,2);
for i=1:m;
z(j)=z(j)+1./m*(theta'*X(i,:)'-y(i)).*X(i,j)^min(j-1,1);
end
end

for i=1:size(X,2);
theta(i)=theta(i)-alpha*z(i);

end

J_history(iter) = computeCostMulti(X, y, theta);

end

end

In [13]:
% Choose some alpha value
alpha = 0.01;
num_iters = 400;

% Init Theta and Run Gradient Descent
theta = zeros(3, 1);
[theta, J_history] = gradientDescentMulti(X, y, theta, alpha, num_iters);


Plot the convergence graph

In [14]:
plot(1:numel(J_history), J_history, '-b', 'LineWidth', 2);
xlabel('Number of iterations');
ylabel('Cost J');


Using Normal equation¶

Instead of gradient descent, we can also use normal equation. The function below uses normal equation to get the coefficients.

In [ ]:
function [theta] = normalEqn(X, y)

theta = zeros(size(X, 2), 1);

theta=pinv(X'*X)*X'*y;

end

In [15]:
% Calculate the parameters from the normal equation
theta = normalEqn(X, y);

In [16]:
% Display normal equation's result
fprintf('Theta computed from the normal equations: \n');
fprintf(' %f \n', theta);
fprintf('\n');

Theta computed from the normal equations:
340412.659574
110631.050279
-6649.474271


Now, let's estimate the price of a 1650 sq-ft, 3 br house.

In [18]:
price = [1,1650,3]*theta;

In [30]:
fprintf(['Predicted price of a 1650 sq-ft, 3 br house ' ...
'(using normal equations):\n $%f\n'], round(price,0));  Predicted price of a 1650 sq-ft, 3 br house (using normal equations):$182861697.000000


Summary¶

This post is on linear regression using both normal equation and gradient descent techniques. We used single variable to predict profit and multiple variables to predict house prices. Generally, we use normal equation if the size of the data set is not huge but if it is, we have to use gradient descent because normal equation is computationally intensive and it complexity is O(n^3).