443-970-2353
[email protected]
CV Resume
When we analyze data in R or Python, it is a common practice to save data in csv, or other native formats. However, saving our data in a database has many advantages. Using databases has huge advantage when our data is big where loading it is impossible or slows down our analytics. If our data resides in a database, rather than loading the whole data, we can work with subsets or aggregates of the data that we are interested in. This helps to have plenty of memory for our actual analysis. Further, if we have many data files, putting our data in a database, rather than in csv or other format, has better security and is easier to manage. Putting our data in a database is also more convinient for collaboration since we can give access to the tables and operations we want.
In this blog post, we will see how to use R and Python with Amazon Relational Database Service (RDS). Amazon RDS is a distributed relational database service by Amazon Web Services (AWS). It simplifies the setup, operation, and scaling of a relational database for use in applications. Amazon RDS does frequent backups and easy replication across instances and this saves us from losing our data. Amazon RDS cosists of MySQL, MariaDB, PostgreSQL, Oracle, SQL Server and Aurora. In this blog post, we will see how to connect R and Python with MySQL, transfer data to the database, query it and use the queried data for further analysis using Pandas in Python and dplyr in R.
Amazon provides MySQL service for free for one year. You check it here
To connect R with the database, we will use the dplyr package.
library(dplyr)
We use src_mysql to connect to MySQL.
host="forrpython.cab6rvva6lca.us-east-1.rds.amazonaws.com"
port=3306
dbname="ForRPython"
user="your_username"
password="your_password"
my_db=src_mysql(dbname=dbname,host=host,port=port,user=user,password=password)
To list the tables in the database, we use the src_tbls function.
src_tbls(my_db)
Since we have not put any tables yet, src_tbls dispalys nothing.
Now, let's transfer data to the databse. We will transfer flights data from the nycflights13 package.This dataset contains all 336776 flights that departed from New York City in 2013.
Some information about the data:
library('nycflights13')
paste0("The flights data has ", nrow(flights), " observations and the variables in the dataset are:")
names(flights)
The following function changes "NA" to "NaN" or empty string. If we have NA in our data frame, our data transfer to a database will fail.
make_mysql_compatible=function(data_frame){
for(this_col in colnames(data_frame)){
isna=is.na(data_frame[[this_col]])
if(any(isna)){
if(is.numeric(data_frame[[this_col]])){
data_frame[[this_col]][isna]=NaN
}
if(is.character(data_frame[[this_col]])){
data_frame[[this_col]][isna]=''
}
}
}
return(data_frame)
}
flights=make_mysql_compatible(flights)
To upload data to the database, we use the dplyr function copy_to. According to the documentation, wherever possible, the new object will be temporary, limited to the current connection to the source. So, we have to change temporary to false to make it permament.
copy_to(my_db,flights,temporary = FALSE) # uploading flights data
Now, let's check if the table have been transferred.
src_tbls(my_db)
We have transfered the data. Now, we can query agaist the database. To connect to a table, we use the tbl function. We use the same dplyr verbs that we use in data manipulation to work with databases. dplyr translates the R code we write to SQL code.
flights_DB = tbl(my_db,"flights")
class(flights_DB)
cat("Number of Distinct flights:")
flights_DB%>%
distinct(origin)%>%
count()
filter(flights_DB, month==1, day==1)%>%count()
flights_DB%>%
group_by(carrier)%>%
summarize(Count=n())%>%
arrange(desc(Count))%>%head(1)
library(ggplot2)
flights_DB%>%
group_by(month)%>%
summarize(Total=n())%>%
collect()%>%
ggplot(aes(x=as.factor(month),y=Total))+
geom_bar(stat="identity",color="purple",fill="skyblue")+
coord_cartesian(ylim = c(25000,30000)) +xlab("Months")
flights_DB%>%
group_by(carrier)%>%
summarize(Longest_distance=max(distance))%>%
collect()%>%
ggplot(aes(x=carrier,y=Longest_distance))+
geom_point(color="red",aes(size=Longest_distance))
flights_DB%>%
group_by(origin,dest)%>%
summarize(Total_flights=n())%>%
arrange(desc(Total_flights))%>%
head(5)
cat("The Carrier with the highest departure delay is ")
flights_DB%>%
group_by(carrier)%>%
summarize(Highest_departure_delay=max(dep_delay))%>%
arrange(desc(Highest_departure_delay))%>%
head(1)
cat("The Carrier with the highest arrival delay is ")
flights_DB%>%
group_by(carrier)%>%
summarize(Highest_arrival_delay=max(arr_delay))%>%
arrange(desc(Highest_arrival_delay))%>%
head(1)
flights_DB%>%
distinct(dest, month)%>%
group_by(month)%>%
summarize(distinct_destinations=n())%>%
collect()%>%
mutate(month = factor(month,levels = month[order(distinct_destinations,decreasing =T)]))%>%
ggplot(aes(x=month,y=distinct_destinations))+
geom_bar(stat='identity',color='skyblue',fill='#b35900')+
xlab("")+ggtitle('Number of unique destinantions for each month')+
ylab('Count')+coord_cartesian(ylim = c(80,100))
We will use the pymysql package to connect to the database. After that, we will use Pandas for data analysis.
import pandas as pd
import pymysql
To connect to the database, we use the connect function from pymysql. The connect function returns a Connection object.
host="forrpython.cab6rvva6lca.us-east-1.rds.amazonaws.com"
port=3306
dbname="ForRPython"
user="your_username"
password="your_password"
conn = pymysql.connect(host, user=user,port=port,
passwd=password, db=dbname)
pd.read_sql('select count(distinct origin) AS "Distinct Origins" from flights;', con=conn)
pd.read_sql('select count(year) from flights where month=1 AND day=1;', con=conn)
pd.read_sql('select carrier, count(*) AS Total from flights group by carrier order by Total desc limit 1;', con=conn)
%matplotlib inline
import matplotlib.pyplot as plt
pandas_df=pd.read_sql('select month from flights;', con=conn)
pandas_df.groupby(pandas_df.month).size().plot(kind='bar',ylim =[24000,30000])
plt.xlabel('Month')
plt.ylabel('# of observation')
plt.show()
pandas_df=pd.read_sql('select carrier, distance from flights ;', con=conn)
pandas_df.groupby(pandas_df.carrier).max().plot(kind='bar')
plt.xlabel('Carrier')
plt.ylabel('Longest distance')
plt.show()
pandas_df=pd.read_sql('select origin, dest, count(*) as Total from flights group by origin, dest order by Total desc limit 5 ;', con=conn)
pandas_df
highest_departure_delay=pd.read_sql('select carrier, max(dep_delay) as Highest_departure_delay from flights group by carrier order by Highest_departure_delay desc limit 1 ;', con=conn)
highest_arrival_delay=pd.read_sql('select carrier, max(arr_delay) as Highest_arrival_delay from flights group by carrier order by Highest_arrival_delay desc limit 1 ;', con=conn)
print(highest_departure_delay)
print(highest_arrival_delay)
pandas_df=pd.read_sql('select distinct month,dest from flights ;', con=conn)
pandas_df.groupby(pandas_df.month).size().plot(kind='bar',ylim =[80,100])
plt.xlabel('Month')
plt.ylabel('# of observation')
plt.show()