Fisseha Berhane, PhD

Data Scientist

443-970-2353 [email protected] CV Resume

The Role of Regular Expressions in Creating a Tidy Data¶

In this analysis, let's prepare a tidy data that can be used for later analysis employing regular expressions in R and demonstrate the strength of regular expressions.

Let's check if the package 'downloader' is installed in our computer. If it is not installed, install it.

In [3]:
if(!require(downloader)){
}

In [4]:
library(downloader)         # load the package for use


In [71]:
require(downloader)
library(plyr)


One of the most exciting areas in all of data science right now is wearable computing - see for example this article . Companies like Fitbit, Nike, and Jawbone Up are racing to develop the most advanced algorithms to attract new users.

The data we will use in this analysis is collected from accelerometers from the Samsung Galaxy S smartphone. A full description is available here.

In [72]:
setwd("C:/Fish/Ranalysis/Re")  # set working directory
rm(list=ls())  # clear workspace


In [49]:
url<-"https://d396qusza40orc.cloudfront.net/getdata%2Fprojectfiles%2FUCI%20HAR%20Dataset.zip"

unzip ("dataset.zip")


Let's see the list of files and folders

In [50]:
dir()

Out[50]:
1. "dataset.zip"
2. "UCI HAR Dataset"

"UCI HAR Dataset" is the directory created after unzipping Dataset.zip

Change directory to the unzipped folder, which is "UCI HAR Dataset"

In [74]:
setwd("C:/Fish/Ranalysis/Re/UCI HAR Dataset")


Let's see the list of files and folders in the "UCI HAR Dataset" folder.

In [75]:
dir()

Out[75]:
1. "activity_labels.txt"
2. "features.txt"
3. "features_info.txt"
4. "merged_and_cleaned_data.txt"
5. "merged_and_cleaned_data2.txt"
7. "test"
8. "tidydata2.txt"
9. "train"

Let's see the contents of the folders test and train.

In [76]:
list.files("./train")

Out[76]:
1. "Inertial Signals"
2. "subject_train.txt"
3. "X_train.txt"
4. "y_train.txt"
In [77]:
list.files("./test")

Out[77]:
1. "Inertial Signals"
2. "subject_test.txt"
3. "X_test.txt"
4. "y_test.txt"

The files to be used in this analysis are shown in the figure below. Files in the Inertial Signals folders are not being used here. From the figure, we see that we will use Activity, Subject and Features as part of descriptive variable names for the final data frame we will create.

Merge the training and the test sets to create one data set¶

As a first step in creating a tidy data, let's merge the training and test sets.

In [78]:
X_train <- read.table("train/X_train.txt")

In [79]:
dim(X_train)

Out[79]:
1. 7352
2. 561
In [80]:
dim(X_test)

Out[80]:
1. 2947
2. 561
In [83]:
X <- rbind(X_train, X_test)

In [84]:
dim(X)

Out[84]:
1. 10299
2. 561
In [85]:
y_train <- read.table("train/y_train.txt")

In [86]:
dim(y_train)

Out[86]:
1. 7352
2. 1
In [87]:
dim(y_test)

Out[87]:
1. 2947
2. 1
In [88]:
Y <- rbind(y_train, y_test)

In [89]:
dim(Y)

Out[89]:
1. 10299
2. 1
In [90]:
subject_train <- read.table("train/subject_train.txt")

In [91]:
dim(subject_train)

Out[91]:
1. 7352
2. 1
In [92]:
dim(subject_test)

Out[92]:
1. 2947
2. 1
In [93]:
Subject <- rbind(subject_train, subject_test)

In [94]:
dim(Subject)

Out[94]:
1. 10299
2. 1
In [95]:
Features <- read.table("features.txt")

dim(Features)

Out[95]:
1. 561
2. 2

Let's set names to the variables¶

In [96]:
names(Subject)<-c("subject")
names(Y)<- c("activity")
names(X)<- Features[ ,2]


Regular expressions¶

Now, let's extract only the measurements on the mean and standard deviation for each measurement using regular expression functions.

help(grep) shows us the functions we can use for regular expressions in R.

Let's search the indices of the names that contain-mean() and -std(). \\ is escape character.

We are using the function grep to search indices that contain mean() and std() in the Features variable.

In [97]:
indices <- grep("-mean\$\$|-std\$\$", Features[, 2])
extracted <- X[, indices]


Let's give name to extracted from Feature.

Let's remove "()" using the regular expression function gsub and change the characters to lower case for readability.

In [99]:
names(extracted) <- Features[indices, 2]
names(extracted) <- gsub("\$|\$", "", names(extracted))
names(extracted) <- tolower(names(extracted))

In [107]:
dim(extracted)

Out[107]:
1. 10299
2. 66

Descriptive activity names¶

Now, let's use descriptive activity names to name the activities in the data set.

In [100]:
activities <- read.table("activity_labels.txt")


Let's see activities

In [101]:
activities

Out[101]:
V1V2
11WALKING
22WALKING_UPSTAIRS
33WALKING_DOWNSTAIRS
44SITTING
55STANDING
66LAYING

Let's remove "_" using the regular expression function gsub and change the characters to lower case for readability.

In [102]:
activities[, 2] = gsub("_", "", tolower(as.character(activities[, 2])))
Y[,1] = activities[Y[,1], 2]
names(Y) <- "activity"

In [109]:
Y[1:10,1]   # Just checking that it has been appropriately renamed

Out[109]:
1. "standing"
2. "standing"
3. "standing"
4. "standing"
5. "standing"
6. "standing"
7. "standing"
8. "standing"
9. "standing"
10. "standing"

Label the data set with descriptive activity names¶

Now, let's appropriately label the data set with descriptive activity names.

In [103]:
names(Subject) <- "subject"

Now, let's create a data frame and save it as merged _and cleaned_data.txt¶
In [111]:
clean <- cbind(Subject,Y,extracted)

write.table(clean, "merged_and_cleaned_data.txt")

dim(clean)

Out[111]:
1. 10299
2. 68

From the clean data set, let's create a second, independent tidy data set with the average of each variable for each activity and each subject.¶

We can use the handy function aggregate for this purpose.

In [112]:
clean2<-aggregate(. ~subject + activity, clean, mean)
clean2<-clean2[order(clean2$subject,clean2$activity),]
write.table(clean2, file = "merged_and_cleaned_data2.txt",row.name=FALSE)

dim(clean2)

Out[112]:
1. 180
2. 68

This is a quick overview of the application of regular expressions in R to create a tidy data that can be used for later analysis.