This is the first programming exercise in the coursera machine learning course offered by Andrew Ng. The course is offered with Matlab/Octave. Since R is the lingua franca data science tool, I plan to do all the programming exercises in Andrew's course with R. Last time, I posted the R version of the anomaly detection excercise. This exercise focuses on linear regression with both analytical (normal equation) and numerical (gradient descent) methods.

The exercise starts with linear regression with one variable. From this part of the exercise, we will create plots that help to visualize how gradient descent gets the coeffient of the predictor and the intercept.

In [3]:

```
library(ggplot2)
library(data.table)
library(magrittr)
library(caret)
library(fields)
library(plot3D)
```

In [48]:

```
ex1data1 <- fread("ex1data1.txt",col.names=c("population","profit"))
head(ex1data1)
```

In [8]:

```
ex1data1%>%ggplot(aes(x=population, y=profit))+
geom_point(color="blue",size=4,alpha=0.5)+
ylab('Profit in $10,000s')+
xlab('Population of City in 10,000s')+ggtitle ('Figure 1: Scatter plot of training data')+
theme(plot.title = element_text(size = 16,colour="red"))
```

To impliment gradient descent, we need to calculate the cost, which is given by:

$$J(\theta)=\frac{1}{2m} \sum\limits_{i=1}^{m}( h_\theta(x^{(i)})-y^{(i)})^2$$

where the hypothesis $h_\theta(x) $ is given by the linear model

when we have one variable and $$h_\theta(x)=\theta^TX=\theta_0+\theta_1x1 + ...+\theta_nx_n$$ when we have n variables (features).

The parameters of our model are the $\theta_j$ values. These are the values we will adjust to minimize cost J($\theta$). One way to do this is to use the batch gradient descent algorithm. In batch gradient descent, each iteration performs the update

$$\theta_j:=\theta_j-\alpha\frac{1}{2m}\sum\limits_{i=1}^{m}( h_\theta(x^{(i)})-y^{(i)})x_j^i$$(simultaneously update $\theta_j$ for all j)

With each step of gradient descent, our parameters $\theta_j$ come closer to the optimal values that will achieve the lowest cost J($\theta$).

But remember, before we go any further we need to add the $\theta_0$ intercept term.

In [49]:

```
X=cbind(1,ex1data1$population)
y=ex1data1$profit
head(X)
```

The function below calcuates cost based on the equation given above.

In [11]:

```
computeCost=function(X,y,theta){
z=((X%*%theta)-y)^2
return(sum(z)/(2*nrow(X)))
}
```

Now, we can calculate the initial cost by initilizating the initial parameters to 0.

In [12]:

```
theta=matrix(rep(0,ncol(X)))
```

In [13]:

```
round(computeCost(X,y,theta),2)
```

**computeCost** function above. A good way to verify that gradient descent is working correctly is to look at the value of J($\theta$) and check that it is decreasing with each step. Our value of J($\theta$) should never increase, and should
converge to a steady value by the end of the algorithm.

**alpha** specified.

In [14]:

```
gradientDescent=function(X, y, theta, alpha, iters){
gd=list()
cost=rep(0,iters)
for(k in 1:iters){
z=rep(0,ncol(X))
for(i in 1:ncol(X)){
for(j in 1:nrow(X)){
z[i]=z[i]+(((X[j,]%*%theta)-y[j])*X[j,i])
}
}
theta= theta-((alpha/nrow(X))*z)
cost[k]=computeCost(X,y,theta)
}
gd$theta= theta
gd$cost=cost
gd
}
```

**gradientDescent** function to find the paramets and we have to make sure that our cost never increases. Let's use 1500 iterations and a learning rate alpha of 0.04 for now. Later, we will see the effect of these values in our application.

In [51]:

```
iterations = 1500
alpha = 0.01
theta= matrix(rep(0, ncol(X)))
gradientDescent_results=gradientDescent(X,y,theta,alpha,iterations)
theta=gradientDescent_results$theta
theta
```

In [16]:

```
data.frame(Cost=gradientDescent_results$cost,Iterations=1:iterations)%>%
ggplot(aes(x=Iterations,y=Cost))+geom_line(color="blue")+
ggtitle("Cost as a function of number of iteration")+
theme(plot.title = element_text(size = 16,colour="red"))
```

In [17]:

```
ex1data1%>%ggplot(aes(x=population, y=profit))+
geom_point(color="blue",size=3,alpha=0.5)+
ylab('Profit in $10,000s')+
xlab('Population of City in 10,000s')+
ggtitle ('Figure 1: Scatter plot of training data') +
geom_abline(intercept = theta[1], slope = theta[2],col="red",show.legend=TRUE)+
theme(plot.title = element_text(size = 16,colour="red"))+
annotate("text", x = 12, y = 20, label = paste0("Profit = ",round(theta[1],4),"+",round(theta[2],4)," * Population"))
```

In [21]:

```
Intercept=seq(from=-10,to=10,length=100)
Slope=seq(from=-1,to=4,length=100)
# initialize cost values to a matrix of 0's
Cost = matrix(0,length(Intercept), length(Slope));
for(i in 1:length(Intercept)){
for(j in 1:length(Slope)){
t = c(Intercept[i],Slope[j])
Cost[i,j]= computeCost(X, y, t)
}
}
persp3D(Intercept,Slope,Cost,theta=-45, phi=25, expand=0.75,lighting = TRUE,
ticktype="detailed", xlab="Intercept", ylab="Slope",
zlab="",axes=TRUE, main="Surface")
image.plot(Intercept,Slope,Cost, main="Contour, showing minimum")
contour(Intercept,Slope,Cost, add = TRUE,n=30,labels='')
points(theta[1],theta[2],col='red',pch=4,lwd=6)
```

$\theta=(X^TX)^{-1}X^Ty$

In [22]:

```
theta2=solve((t(X)%*%X))%*%t(X)%*%y
theta2
```

The parameters we got from gradient descent are:

In [23]:

```
theta
```

In [52]:

```
iterations = 15000
alpha = 0.01
theta= matrix(rep(0, ncol(X)))
gradientDescent_results=gradientDescent(X,y,theta,alpha,iterations)
theta=gradientDescent_results$theta
theta
```

As you can see, now the results from normal equation and gradient descent are the same.

**caret** package, which is among the most commonly used machine learning packages in R.

In [25]:

```
my_lm <- train(profit~population, data=ex1data1,method = "lm")
my_lm$finalModel$coefficients
```

**painless!!**

In [32]:

```
ex1data2 <- fread("ex1data2.txt",col.names=c("size","bedrooms","price"))
head(ex1data2)
```

In [33]:

```
ex1data2=as.data.frame(ex1data2)
class(ex1data2)
for(i in 1:(ncol(ex1data2)-1)){
ex1data2[,i]=(ex1data2[,i]-mean(ex1data2[,i]))/sd(ex1data2[,i])
}
```

In [34]:

```
round(apply(ex1data2,2,mean),10)
```

In [35]:

```
apply(ex1data2,2,sd)
```

In [36]:

```
X=cbind(1,ex1data2$size,ex1data2$bedrooms)
y=ex1data2$price
head(X)
```

In [40]:

```
iterations = 6000
alpha = 0.01
theta= matrix(rep(0, ncol(X)))
gradientDescent_results=gradientDescent(X,y,theta,alpha,iterations)
theta=gradientDescent_results$theta
theta
```

In [41]:

```
theta2=solve((t(X)%*%X))%*%t(X)%*%y
theta2
```

In [42]:

```
ex1data2 <- fread("ex1data2.txt",col.names=c("size","bedrooms","price"))
my_lm <- train(price~size+bedrooms, data=ex1data2,method = "lm",
preProcess = c("center","scale"))
my_lm$finalModel$coefficients
```