Fisseha Berhane, PhD

Data Scientist

443-970-2353 [email protected] CV Resume Linkedin GitHub twitter twitter

The Role of Regular Expressions in Creating a Tidy Data

In this analysis, let's prepare a tidy data that can be used for later analysis employing regular expressions in R and demonstrate the strength of regular expressions.

Let's check if the package 'downloader' is installed in our computer. If it is not installed, install it.

In [3]:
if(!require(downloader)){
    install.packages('downloader', repos='http://cran.us.r-project.org')
}
In [4]:
library(downloader)         # load the package for use

Load other required packages.

In [71]:
require(downloader)
library(plyr)

One of the most exciting areas in all of data science right now is wearable computing - see for example this article . Companies like Fitbit, Nike, and Jawbone Up are racing to develop the most advanced algorithms to attract new users.

The data we will use in this analysis is collected from accelerometers from the Samsung Galaxy S smartphone. A full description is available here.

In [72]:
setwd("C:/Fish/Ranalysis/Re")  # set working directory
rm(list=ls())  # clear workspace

Download the data as Dataset.zip from here and unzip it.

In [49]:
url<-"https://d396qusza40orc.cloudfront.net/getdata%2Fprojectfiles%2FUCI%20HAR%20Dataset.zip"

download(url, dest="dataset.zip") 

unzip ("dataset.zip")

Let's see the list of files and folders

In [50]:
dir()
Out[50]:
  1. "dataset.zip"
  2. "UCI HAR Dataset"

"UCI HAR Dataset" is the directory created after unzipping Dataset.zip

Change directory to the unzipped folder, which is "UCI HAR Dataset"

In [74]:
setwd("C:/Fish/Ranalysis/Re/UCI HAR Dataset")

Let's see the list of files and folders in the "UCI HAR Dataset" folder.

In [75]:
dir()
Out[75]:
  1. "activity_labels.txt"
  2. "features.txt"
  3. "features_info.txt"
  4. "merged_and_cleaned_data.txt"
  5. "merged_and_cleaned_data2.txt"
  6. "README.txt"
  7. "test"
  8. "tidydata2.txt"
  9. "train"

Let's see the contents of the folders test and train.

In [76]:
list.files("./train")
Out[76]:
  1. "Inertial Signals"
  2. "subject_train.txt"
  3. "X_train.txt"
  4. "y_train.txt"
In [77]:
list.files("./test")
Out[77]:
  1. "Inertial Signals"
  2. "subject_test.txt"
  3. "X_test.txt"
  4. "y_test.txt"

"README.txt" contains detailed information about the different data sets.

The files to be used in this analysis are shown in the figure below. Files in the Inertial Signals folders are not being used here. From the figure, we see that we will use Activity, Subject and Features as part of descriptive variable names for the final data frame we will create.


Merge the training and the test sets to create one data set

As a first step in creating a tidy data, let's merge the training and test sets.

In [78]:
X_train <- read.table("train/X_train.txt")
X_test <- read.table("test/X_test.txt")
In [79]:
dim(X_train)
Out[79]:
  1. 7352
  2. 561
In [80]:
dim(X_test)
Out[80]:
  1. 2947
  2. 561
In [83]:
X <- rbind(X_train, X_test)
In [84]:
dim(X)
Out[84]:
  1. 10299
  2. 561
In [85]:
y_train <- read.table("train/y_train.txt")
y_test <- read.table("test/y_test.txt")
In [86]:
dim(y_train)
Out[86]:
  1. 7352
  2. 1
In [87]:
dim(y_test)
Out[87]:
  1. 2947
  2. 1
In [88]:
Y <- rbind(y_train, y_test)
In [89]:
dim(Y)
Out[89]:
  1. 10299
  2. 1
In [90]:
subject_train <- read.table("train/subject_train.txt")
subject_test <- read.table("test/subject_test.txt")
In [91]:
dim(subject_train)
Out[91]:
  1. 7352
  2. 1
In [92]:
dim(subject_test)
Out[92]:
  1. 2947
  2. 1
In [93]:
Subject <- rbind(subject_train, subject_test)
In [94]:
dim(Subject)
Out[94]:
  1. 10299
  2. 1
In [95]:
Features <- read.table("features.txt")

dim(Features)
Out[95]:
  1. 561
  2. 2

Let's set names to the variables

In [96]:
names(Subject)<-c("subject")
names(Y)<- c("activity")
names(X)<- Features[ ,2]

Regular expressions

Now, let's extract only the measurements on the mean and standard deviation for each measurement using regular expression functions.

help(grep) shows us the functions we can use for regular expressions in R.

Let's search the indices of the names that contain-mean() and -std(). \\ is escape character.

We are using the function grep to search indices that contain mean() and std() in the Features variable.

In [97]:
indices <- grep("-mean\\(\\)|-std\\(\\)", Features[, 2])
extracted <- X[, indices]

Let's give name to extracted from Feature.

Let's remove "()" using the regular expression function gsub and change the characters to lower case for readability.

In [99]:
names(extracted) <- Features[indices, 2]
names(extracted) <- gsub("\\(|\\)", "", names(extracted))
names(extracted) <- tolower(names(extracted))
In [107]:
dim(extracted)
Out[107]:
  1. 10299
  2. 66

Descriptive activity names

Now, let's use descriptive activity names to name the activities in the data set.

In [100]:
activities <- read.table("activity_labels.txt")

Let's see activities

In [101]:
activities
Out[101]:
V1V2
11WALKING
22WALKING_UPSTAIRS
33WALKING_DOWNSTAIRS
44SITTING
55STANDING
66LAYING

Let's remove "_" using the regular expression function gsub and change the characters to lower case for readability.

In [102]:
activities[, 2] = gsub("_", "", tolower(as.character(activities[, 2])))
Y[,1] = activities[Y[,1], 2]
names(Y) <- "activity"
In [109]:
Y[1:10,1]   # Just checking that it has been appropriately renamed
Out[109]:
  1. "standing"
  2. "standing"
  3. "standing"
  4. "standing"
  5. "standing"
  6. "standing"
  7. "standing"
  8. "standing"
  9. "standing"
  10. "standing"

Label the data set with descriptive activity names

Now, let's appropriately label the data set with descriptive activity names.

In [103]:
names(Subject) <- "subject"
Now, let's create a data frame and save it as merged _and cleaned_data.txt
In [111]:
clean <- cbind(Subject,Y,extracted)

write.table(clean, "merged_and_cleaned_data.txt")

dim(clean)
Out[111]:
  1. 10299
  2. 68

From the clean data set, let's create a second, independent tidy data set with the average of each variable for each activity and each subject.

We can use the handy function aggregate for this purpose.

In [112]:
clean2<-aggregate(. ~subject + activity, clean, mean)
clean2<-clean2[order(clean2$subject,clean2$activity),]
write.table(clean2, file = "merged_and_cleaned_data2.txt",row.name=FALSE)

dim(clean2)
Out[112]:
  1. 180
  2. 68


This is a quick overview of the application of regular expressions in R to create a tidy data that can be used for later analysis.

comments powered by Disqus