Fisseha Berhane, PhD

Data Scientist

443-970-2353 [email protected] CV Resume Linkedin GitHub twitter twitter

Using Amazon Relational Database Service with Python and R

When we analyze data in R or Python, it is a common practice to save data in csv, or other native formats. However, saving our data in a database has many advantages. Using databases has huge advantage when our data is big where loading it is impossible or slows down our analytics. If our data resides in a database, rather than loading the whole data, we can work with subsets or aggregates of the data that we are interested in. This helps to have plenty of memory for our actual analysis. Further, if we have many data files, putting our data in a database, rather than in csv or other format, has better security and is easier to manage. Putting our data in a database is also more convinient for collaboration since we can give access to the tables and operations we want.

In this blog post, we will see how to use R and Python with Amazon Relational Database Service (RDS). Amazon RDS is a distributed relational database service by Amazon Web Services (AWS). It simplifies the setup, operation, and scaling of a relational database for use in applications. Amazon RDS does frequent backups and easy replication across instances and this saves us from losing our data. Amazon RDS cosists of MySQL, MariaDB, PostgreSQL, Oracle, SQL Server and Aurora. In this blog post, we will see how to connect R and Python with MySQL, transfer data to the database, query it and use the queried data for further analysis using Pandas in Python and dplyr in R.

Amazon provides MySQL service for free for one year. You check it here

Launching an AWS Database (DB) instance

  1. If you do not have an AWS account, the first step is opening one. You can visit this to open an account.
  2. Once you create an account, click at "Services" at the top left corner and then select "Database RDS"
  3. Click "Get Started Now" and select "MySQL"
  4. On the next page, on step 2, select "Dev/Test". Note the production version is not free.
  5. On the next page, on step 3, specifiy DB details. Make sure to choose free tier by selecting the radio buttion that says "Only show options that are eligible for RDS Free Tier". Provide username and password and make a note of them as we will use them to connect to the database using R and Python.
  6. On the next page, provide database name and keep a note of the database port (by default it is 3306)
  7. Once the DB instance is available, you will see the instance Endpoint, which is what we will use to specify the host when we connect R and Python to the database instance.

Working with AWS RDS MySQL with R

To connect R with the database, we will use the dplyr package.

In [2]:
library(dplyr)

We use src_mysql to connect to MySQL.

In [3]:
host="forrpython.cab6rvva6lca.us-east-1.rds.amazonaws.com"
port=3306
dbname="ForRPython"
user="your_username"
password="your_password"

my_db=src_mysql(dbname=dbname,host=host,port=port,user=user,password=password)

To list the tables in the database, we use the src_tbls function.

In [ ]:
src_tbls(my_db)

Since we have not put any tables yet, src_tbls dispalys nothing.

Put data in the database
Data

Now, let's transfer data to the databse. We will transfer flights data from the nycflights13 package.This dataset contains all 336776 flights that departed from New York City in 2013.

Some information about the data:

In [123]:
library('nycflights13')
paste0("The flights data has ", nrow(flights), " observations and the variables in the dataset are:")
names(flights)
'The flights data has 336776 observations and the variables in the dataset are:'
  1. 'year'
  2. 'month'
  3. 'day'
  4. 'dep_time'
  5. 'sched_dep_time'
  6. 'dep_delay'
  7. 'arr_time'
  8. 'sched_arr_time'
  9. 'arr_delay'
  10. 'carrier'
  11. 'flight'
  12. 'tailnum'
  13. 'origin'
  14. 'dest'
  15. 'air_time'
  16. 'distance'
  17. 'hour'
  18. 'minute'
  19. 'time_hour'

The following function changes "NA" to "NaN" or empty string. If we have NA in our data frame, our data transfer to a database will fail.

In [124]:
make_mysql_compatible=function(data_frame){
    for(this_col in colnames(data_frame)){
      isna=is.na(data_frame[[this_col]])
      if(any(isna)){
            if(is.numeric(data_frame[[this_col]])){
                data_frame[[this_col]][isna]=NaN
            }
             if(is.character(data_frame[[this_col]])){
                data_frame[[this_col]][isna]=''
            }
          }
        }
    return(data_frame)
    }
In [125]:
flights=make_mysql_compatible(flights)
Transfer data

To upload data to the database, we use the dplyr function copy_to. According to the documentation, wherever possible, the new object will be temporary, limited to the current connection to the source. So, we have to change temporary to false to make it permament.

In [126]:
copy_to(my_db,flights,temporary = FALSE) # uploading flights data
Source:   query [?? x 19]
Database: mysql 5.6.27-log [[email protected]m:/ForRPython]

    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <dbl>          <int>     <dbl>    <dbl>          <int>
1   2013     1     1      517            515         2      830            819
2   2013     1     1      533            529         4      850            830
3   2013     1     1      542            540         2      923            850
4   2013     1     1      544            545        -1     1004           1022
5   2013     1     1      554            600        -6      812            837
6   2013     1     1      554            558        -4      740            728
7   2013     1     1      555            600        -5      913            854
8   2013     1     1      557            600        -3      709            723
9   2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ... with more rows, and 11 more variables: arr_delay <dbl>, carrier <chr>,
#   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
#   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <chr>

Now, let's check if the table have been transferred.

In [128]:
src_tbls(my_db)
'flights'

We have transfered the data. Now, we can query agaist the database. To connect to a table, we use the tbl function. We use the same dplyr verbs that we use in data manipulation to work with databases. dplyr translates the R code we write to SQL code.

In [129]:
flights_DB = tbl(my_db,"flights")
In [130]:
class(flights_DB)
  1. 'tbl_mysql'
  2. 'tbl_sql'
  3. 'tbl_lazy'
  4. 'tbl'
  • How many distinct origins are there in the flights table?
In [131]:
cat("Number of Distinct flights:")
flights_DB%>%
      distinct(origin)%>%
      count()
Number of Distinct flights:
Source:   query [?? x 1]
Database: mysql 5.6.27-log [[email protected]m:/ForRPython]

      n
  <dbl>
1     3
  • How many flights were made on January first?
In [132]:
filter(flights_DB, month==1, day==1)%>%count()
Source:   query [?? x 1]
Database: mysql 5.6.27-log [[email protected]m:/ForRPython]

      n
  <dbl>
1   842
  • What is the most common carrier?
In [99]:
flights_DB%>%
      group_by(carrier)%>%
      summarize(Count=n())%>%
      arrange(desc(Count))%>%head(1)
Source:   query [?? x 2]
Database: mysql 5.6.27-log [[email protected]m:/ForRPython]

  carrier Count
    <chr> <dbl>
1      UA 58665
  • Which months have the highest number of flights?
In [133]:
library(ggplot2)

flights_DB%>%
       group_by(month)%>%
       summarize(Total=n())%>%
       collect()%>%
       ggplot(aes(x=as.factor(month),y=Total))+
        geom_bar(stat="identity",color="purple",fill="skyblue")+
        coord_cartesian(ylim = c(25000,30000)) +xlab("Months")
  • What is the longest distance for each carrier?
In [136]:
flights_DB%>%
       group_by(carrier)%>%
       summarize(Longest_distance=max(distance))%>%
       collect()%>%
       ggplot(aes(x=carrier,y=Longest_distance))+
       geom_point(color="red",aes(size=Longest_distance))
  • What are the five most common flight routes?
In [137]:
flights_DB%>%
     group_by(origin,dest)%>%
     summarize(Total_flights=n())%>%
     arrange(desc(Total_flights))%>%
     head(5)
Source:   query [?? x 3]
Database: mysql 5.6.27-log [[email protected]m:/ForRPython]
Groups: origin

  origin  dest Total_flights
   <chr> <chr>         <dbl>
1    JFK   LAX         11262
2    LGA   ATL         10263
3    LGA   ORD          8857
4    JFK   SFO          8204
5    LGA   CLT          6168
  • Is the carrier with the highest departure delay also with the highest arrival delay?
In [138]:
cat("The Carrier with the highest departure delay is ")
flights_DB%>%
     group_by(carrier)%>%
     summarize(Highest_departure_delay=max(dep_delay))%>%
     arrange(desc(Highest_departure_delay))%>%
     head(1)
The Carrier with the highest departure delay is 
Source:   query [?? x 2]
Database: mysql 5.6.27-log [[email protected]m:/ForRPython]

  carrier Highest_departure_delay
    <chr>                   <dbl>
1      HA                    1301
In [139]:
cat("The Carrier with the highest arrival delay is ")
flights_DB%>%
     group_by(carrier)%>%
     summarize(Highest_arrival_delay=max(arr_delay))%>%
     arrange(desc(Highest_arrival_delay))%>%
     head(1)
The Carrier with the highest arrival delay is 
Source:   query [?? x 2]
Database: mysql 5.6.27-log [[email protected]m:/ForRPython]

  carrier Highest_arrival_delay
    <chr>                 <dbl>
1      HA                  1272
  • Are the unique destinations the same in all months?
In [147]:
flights_DB%>%
      distinct(dest, month)%>%
      group_by(month)%>%
      summarize(distinct_destinations=n())%>%
      collect()%>%
      mutate(month = factor(month,levels = month[order(distinct_destinations,decreasing =T)]))%>%
      ggplot(aes(x=month,y=distinct_destinations))+
          geom_bar(stat='identity',color='skyblue',fill='#b35900')+
          xlab("")+ggtitle('Number of unique destinantions for each month')+
          ylab('Count')+coord_cartesian(ylim = c(80,100))

Working with AWS RDS MySQL with Python

We will use the pymysql package to connect to the database. After that, we will use Pandas for data analysis.

In [10]:
import pandas as pd
import pymysql
Connect to database

To connect to the database, we use the connect function from pymysql. The connect function returns a Connection object.

In [14]:
host="forrpython.cab6rvva6lca.us-east-1.rds.amazonaws.com"
port=3306
dbname="ForRPython"
user="your_username"
password="your_password"

conn = pymysql.connect(host, user=user,port=port,
                           passwd=password, db=dbname)
  • How many distinct origins are there in the flights table?
In [24]:
pd.read_sql('select count(distinct origin) AS "Distinct Origins" from flights;', con=conn) 
Out[24]:
Distinct Origins
0 3
  • How many flights were made on January first?
In [27]:
pd.read_sql('select count(year)  from flights where month=1 AND day=1;', con=conn) 
Out[27]:
count(year)
0 842
  • What is the most common carrier?
In [30]:
pd.read_sql('select carrier, count(*) AS Total from flights group by carrier order by Total desc  limit 1;', con=conn) 
Out[30]:
carrier Total
0 UA 58665
  • Which months have the highest number of flights?
In [58]:
%matplotlib inline
import matplotlib.pyplot as plt
pandas_df=pd.read_sql('select month from flights;', con=conn) 
pandas_df.groupby(pandas_df.month).size().plot(kind='bar',ylim =[24000,30000])
plt.xlabel('Month')
plt.ylabel('# of observation')
plt.show()
  • What is the longest distance for each carrier?
In [ ]:
In [84]:
pandas_df=pd.read_sql('select carrier, distance  from flights ;', con=conn) 
pandas_df.groupby(pandas_df.carrier).max().plot(kind='bar')
plt.xlabel('Carrier')
plt.ylabel('Longest distance')
plt.show()
  • What are the five most common flight routes?
In [87]:
pandas_df=pd.read_sql('select origin, dest, count(*) as Total  from flights group by origin, dest order by Total desc limit 5 ;', con=conn)
pandas_df
Out[87]:
origin dest Total
0 JFK LAX 11262
1 LGA ATL 10263
2 LGA ORD 8857
3 JFK SFO 8204
4 LGA CLT 6168
  • Is the carrier with the highest departure delay also with the highest arrival delay?
In [88]:
highest_departure_delay=pd.read_sql('select carrier, max(dep_delay) as Highest_departure_delay  from flights group by carrier order by Highest_departure_delay desc limit 1 ;', con=conn)
highest_arrival_delay=pd.read_sql('select carrier, max(arr_delay) as Highest_arrival_delay  from flights group by carrier order by Highest_arrival_delay desc limit 1 ;', con=conn)
In [94]:
print(highest_departure_delay)
  carrier  Highest_departure_delay
0      HA                   1301.0
In [95]:
print(highest_arrival_delay)
  carrier  Highest_arrival_delay
0      HA                 1272.0
  • Are the unique destinations the same in all months?
In [100]:
pandas_df=pd.read_sql('select distinct month,dest from flights ;', con=conn)
pandas_df.groupby(pandas_df.month).size().plot(kind='bar',ylim =[80,100])
plt.xlabel('Month')
plt.ylabel('# of observation')
plt.show()
comments powered by Disqus