Fisseha Berhane, PhD

Data Scientist

443-970-2353 [email protected] CV Resume Linkedin GitHub twitter twitter

Using MongoDB with R and Python

MongoDB is a NoSQL database program which uses JSON-like documents with schemas. It is free and open-source cross-platform database. MongoDB, top NoSQL database engine in use today, could be a good data storage alternative when analysing large volume data.

To use MongoDB with R and Python, first, we have to download and install MongoDB

Next, start MongoDB. We can start MongoDB like so:

In [ ]:
mongod

For R, we can use the mongolite package which is a fast and simple MongoDB client for R. For Python, on the other hand, we can use the PyMongo Python driver which is a Python distribution containing tools for working with MongoDB.

Installation
R
In [ ]:
install.packages("mongolite",dependencies =TRUE,repos='http://cran.us.r-project.org')
Python

We can use pip or easy_install as below:

In [ ]:
pip install pymongo
In [ ]:
easy_install pymongo

Or if we are using Anaconda distribution, we can install it with conda:

In [ ]:
conda install -c anaconda pymongo=3.3.0
Inserting data

Let's insert the crimes data from data.gov to MongoDB and query it using both R and Python. The dataset reflects reported incidents of crime (with the exception of murders where data exists for each victim) that occurred in the City of Chicago since 2001.

In [69]:
library(ggplot2)
library(dplyr)
library(maps)
library(ggmap)
library(mongolite)
library(lubridate)
library(gridExtra)
In [4]:
crimes=data.table::fread("Crimes_2001_to_present.csv")
Read 6261148 rows and 22 (of 22) columns from 1.371 GB file in 00:00:51
In [8]:
cat("The crimes data has ",nrow(crimes)," observation and ",ncol(crimes)," columns")
The crimes data has  6261148  observation and  22  columns
In [9]:
names(crimes)
  1. 'ID'
  2. 'Case Number'
  3. 'Date'
  4. 'Block'
  5. 'IUCR'
  6. 'Primary Type'
  7. 'Description'
  8. 'Location Description'
  9. 'Arrest'
  10. 'Domestic'
  11. 'Beat'
  12. 'District'
  13. 'Ward'
  14. 'Community Area'
  15. 'FBI Code'
  16. 'X Coordinate'
  17. 'Y Coordinate'
  18. 'Year'
  19. 'Updated On'
  20. 'Latitude'
  21. 'Longitude'
  22. 'Location'

Let's remove spaces in the column names to avoid any problems related with that.

In [10]:
names(crimes) = gsub(" ","",names(crimes)) 
names(crimes)
  1. 'ID'
  2. 'CaseNumber'
  3. 'Date'
  4. 'Block'
  5. 'IUCR'
  6. 'PrimaryType'
  7. 'Description'
  8. 'LocationDescription'
  9. 'Arrest'
  10. 'Domestic'
  11. 'Beat'
  12. 'District'
  13. 'Ward'
  14. 'CommunityArea'
  15. 'FBICode'
  16. 'XCoordinate'
  17. 'YCoordinate'
  18. 'Year'
  19. 'UpdatedOn'
  20. 'Latitude'
  21. 'Longitude'
  22. 'Location'

With R, we can use the insert function from the mongolite package to insert rows to a collection in MongoDB. In Python, we can use the insert_one() method and the insert_many() method to add documents to a collection in MongoDB. In the next steps, I will show the commands that perform similar tasks with both R and Python. The Python codes are commented out.

Let's create a database called Chicago and call the collection crimes.

In [5]:
my_collection = mongo(collection = "crimes", db = "Chicago") # create connection, database and collection
my_collection$insert(crimes)                                 # insert data



# from pymongo import MongoClient    #   Create a MongoClient to the running mongod instance.
# client = MongoClient()             #   Connect on the default host and port
# db = client.Chicago                #   database
# my_collection=db.crimes            # collection
# my_collection.insert_one(crimes)   # insert crimes data

Let's check if we have inserted the "crimes" data.

In [6]:
my_collection$count()

# my_collection.count()
6261148

We see that the count() function above and nrow() give the same number of records

Performing a query and retrieving data

First, let's look what the data looks like by displaying one record:

In [13]:
my_collection$iterate()$one()

# In Python, we use find_one() to get the first document in the collection
# my_collection.find_one()
$ID
1454164
$CaseNumber
'G185744'
$Date
'04/01/2001 06:00:00 PM'
$Block
'049XX N MENARD AV'
$IUCR
'0910'
$PrimaryType
'MOTOR VEHICLE THEFT'
$Description
'AUTOMOBILE'
$LocationDescription
'STREET'
$Arrest
'false'
$Domestic
'false'
$Beat
1622
$District
16
$FBICode
'07'
$XCoordinate
1136545
$YCoordinate
1932203
$Year
2001
$UpdatedOn
'08/17/2015 03:03:40 PM'
$Latitude
41.970129962
$Longitude
-87.773302309
$Location
'(41.970129962, -87.773302309)'

How many distinct "Primary Type" do we have?

In [7]:
length(my_collection$distinct("PrimaryType"))

# len(my_collection.distinct("PrimaryType"))
35

As shown above, there are 35 different crime primary types in the database. We will see the patterns of the most common crime types below.

From the $iterate()$$one() command above, we have seen that one of the columns is Domestic, which shows whether the crime is domestic or not. Now, let's see how many domestic assualts there are in the collection.

In [16]:
my_collection$count('{"PrimaryType" : "ASSAULT", "Domestic" : "true" }')
# my_collection.count({"PrimaryType" : "ASSAULT", "Domestic" : "true" })
82470

We can check with dplyr, like so:

In [17]:
crimes%>%filter(PrimaryType=="ASSAULT",Domestic=="true")%>%nrow()
82470

To get the filtered data:

In [18]:
query1= my_collection$find('{"PrimaryType" : "ASSAULT", "Domestic" : "true" }')
# query1=my_collection.find({"Primary Type" : "ASSAULT", "Domestic" : "true" }) #  find() returns a Cursor instance, which allows us to iterate over all matching documents
 Imported 82470 records. Simplifying into dataframe...

We can also retrieve only the columns of interest:

In [8]:
query2= my_collection$find('{"PrimaryType" : "ASSAULT", "Domestic" : "true" }',
                           fields = '{"_id":0, "PrimaryType":1, "Domestic":1}')

#query2= my_collection.find('{"PrimaryType" : "ASSAULT", "Domestic" : "true" }','{"_id":0, "PrimaryType":1, "Domestic":1}')
 Imported 82470 records. Simplifying into dataframe...
In [20]:
ncol(query1) # with all the columns
ncol(query2) # only the selected columns
22
2
In [11]:
head(query2,4)
PrimaryTypeDomestic
ASSAULTtrue
ASSAULTtrue
ASSAULTtrue
ASSAULTtrue

How do we do the above two queries with dplyr:

In [21]:
query1=crimes%>%filter(PrimaryType=="ASSAULT", Domestic=="true" )
query2=crimes%>%filter(PrimaryType=="ASSAULT", Domestic=="true" )%>%select(PrimaryType,Domestic)
In [22]:
ncol(query1)
ncol(query2)
22
2

Where do most crimes take pace?

In [90]:
my_collection$aggregate('[{"$group":{"_id":"$LocationDescription", "Count": {"$sum":1}}}]')%>%na.omit()%>%
arrange(desc(Count))%>%head(10)%>%
ggplot(aes(x=reorder(`_id`,Count),y=Count))+
geom_bar(stat="identity",color='skyblue',fill='#b35900')+geom_text(aes(label = Count), color = "blue") +coord_flip()+xlab("Location Description")
 Imported 174 records. Simplifying into dataframe...

If loading the entire dataset we are working with does not slow down our analysis, we can use data.table or dplyr but when dealing with big data, using MongoDB can give us performance boost as the whole data will not be loaded into mememory. We can reproduce the above plot without using MongoDB, like so:

In [89]:
crimes%>%group_by(`LocationDescription`)%>%summarise(Total=n())%>% arrange(desc(Total))%>%head(10)%>%
ggplot(aes(x=reorder(`LocationDescription`,Total),y=Total))+
geom_bar(stat="identity",color='skyblue',fill='#b35900')+geom_text(aes(label = Total), color = "blue") +coord_flip()+xlab("Location Description")

What if we want to query all records for certain columns only? This helps us to load only the columns we want and to save memory for our analysis.

In [12]:
query3= my_collection$find('{}', fields = '{"_id":0, "Latitude":1, "Longitude":1,"Year":1}')
# query3= my_collection.find('{}','{"_id":0, "Latitude":1, "Longitude":1}')
 Imported 6261148 records. Simplifying into dataframe...
Let's explore domestic crimes

We can explore any patterns of domestic crimes. For example, are they common in certain days/hours/months?

In [115]:
domestic=my_collection$find('{"Domestic":"true"}', fields = '{"_id":0, "Domestic":1,"Date":1}')
domestic$Date= mdy_hms(domestic$Date)
domestic$Weekday = weekdays(domestic$Date)
domestic$Hour = hour(domestic$Date)
domestic$month = month(domestic$Date,label=TRUE)
 Imported 806839 records. Simplifying into dataframe...
In [24]:
WeekdayCounts = as.data.frame(table(domestic$Weekday))
WeekdayCounts$Var1 = factor(WeekdayCounts$Var1, ordered=TRUE, levels=c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday","Saturday"))
ggplot(WeekdayCounts, aes(x=Var1, y=Freq)) + geom_line(aes(group=1),size=2,color="red") + xlab("Day of the Week") + ylab("Total Domestic Crimes")+
ggtitle("Domestic Crimes in the City of Chicago Since 2001")+
theme(axis.title.x=element_blank(),axis.text.y = element_text(color="blue",size=11,angle=0,hjust=1,vjust=0),
      axis.text.x = element_text(color="blue",size=11,angle=0,hjust=.5,vjust=.5),
     axis.title.y = element_text(size=14),
     plot.title=element_text(size=16,color="purple",hjust=0.5))

Domestic crimes are common over the weekend than in weekdays? What could be the reason?

We can also see the pattern for each day by hour

In [39]:
DayHourCounts = as.data.frame(table(domestic$Weekday, domestic$Hour))
DayHourCounts$Hour = as.numeric(as.character(DayHourCounts$Var2))
ggplot(DayHourCounts, aes(x=Hour, y=Freq)) + geom_line(aes(group=Var1, color=Var1), size=1.4)+ylab("Count")+
ylab("Total Domestic Crimes")+ggtitle("Domestic Crimes in the City of Chicago Since 2001")+
theme(axis.title.x=element_text(size=14),axis.text.y = element_text(color="blue",size=11,angle=0,hjust=1,vjust=0),
      axis.text.x = element_text(color="blue",size=11,angle=0,hjust=.5,vjust=.5),
     axis.title.y = element_text(size=14),
     legend.title=element_blank(),
     plot.title=element_text(size=16,color="purple",hjust=0.5))

The crimes peak mainly around mid-night. We can also use one color for weekdays and another color for weekend as shown below.

In [41]:
DayHourCounts$Type = ifelse((DayHourCounts$Var1 == "Sunday") | (DayHourCounts$Var1 == "Saturday"), "Weekend", "Weekday")

ggplot(DayHourCounts, aes(x=Hour, y=Freq)) + geom_line(aes(group=Var1, color=Type), size=2, alpha=0.5) +
ylab("Total Domestic Crimes")+ggtitle("Domestic Crimes in the City of Chicago Since 2001")+
theme(axis.title.x=element_text(size=14),axis.text.y = element_text(color="blue",size=11,angle=0,hjust=1,vjust=0),
      axis.text.x = element_text(color="blue",size=11,angle=0,hjust=.5,vjust=.5),
     axis.title.y = element_text(size=14),
     legend.title=element_blank(),
     plot.title=element_text(size=16,color="purple",hjust=0.5))

The difference between weekend and weekdays are more clear from this figure than from the previous plot. We can also see the above pattern from a heatmap.

In [42]:
DayHourCounts$Var1 = factor(DayHourCounts$Var1, ordered=TRUE, levels=c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))
ggplot(DayHourCounts, aes(x = Hour, y = Var1)) + geom_tile(aes(fill = Freq)) + scale_fill_gradient(name="Total MV Thefts", low="white", high="red") +
ggtitle("Domestic Crimes in the City of Chicago Since 2001")+theme(axis.title.y = element_blank())+ylab("")+
theme(axis.title.x=element_text(size=14),axis.text.y = element_text(size=13),
      axis.text.x = element_text(size=13),
     axis.title.y = element_text(size=14),
     legend.title=element_blank(),
     plot.title=element_text(size=16,color="purple",hjust=0.5))

From the heatmap, we can see more crimes over weekends and at night.

In [120]:
monthCounts = as.data.frame(table(domestic$month))
ggplot(monthCounts, aes(x=Var1, y=Freq)) + geom_line(aes(group=1),size=2,color="red") + xlab("Day of the Week") + ylab("Total Domestic Crimes")+
ggtitle("Domestic Crimes in the City of Chicago Since 2001")+
theme(axis.title.x=element_blank(),axis.text.y = element_text(color="blue",size=11,angle=0,hjust=1,vjust=0),
      axis.text.x = element_text(color="blue",size=11,angle=0,hjust=.5,vjust=.5),
     axis.title.y = element_text(size=14),
     plot.title=element_text(size=16,color="purple",hjust=0.5))

Is domestic crime associated with tempreature? Domestic crimes tremendoulsy increases during the warm months.

Now, let's see the pattern of other crime types. Since there are 35 primary types, we cannot see all of them in this post. Let's focus on four of the most common ones.

In [44]:
crimes=my_collection$find('{}', fields = '{"_id":0, "PrimaryType":1,"Year":1}')
crimes%>%group_by(PrimaryType)%>%summarize(Count=n())%>%arrange(desc(Count))%>%head(4)
 Imported 6261148 records. Simplifying into dataframe...
PrimaryTypeCount
THEFT 1301434
BATTERY 1142377
CRIMINAL DAMAGE 720143
NARCOTICS 687790

As shown in the table above, the most common crime type is theft followed by battery. Narcotics is fourth most common while criminal damage is the third most common crime type in the city of Chicago.

Now, let's generate plots by day and hour.

In [126]:
four_most_common=crimes%>%group_by(PrimaryType)%>%summarize(Count=n())%>%arrange(desc(Count))%>%head(4)
four_most_common=four_most_common$PrimaryType
In [127]:
crimes=my_collection$find('{}', fields = '{"_id":0, "PrimaryType":1,"Date":1}')
crimes=filter(crimes,PrimaryType %in%four_most_common)
crimes$Date= mdy_hms(crimes$Date)
crimes$Weekday = weekdays(crimes$Date)
crimes$Hour = hour(crimes$Date)
crimes$month=month(crimes$Date,label = TRUE)
 Imported 6261148 records. Simplifying into dataframe...
In [128]:
g = function(data){
    WeekdayCounts = as.data.frame(table(data$Weekday))
    WeekdayCounts$Var1 = factor(WeekdayCounts$Var1, ordered=TRUE, levels=c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday","Saturday"))
    ggplot(WeekdayCounts, aes(x=Var1, y=Freq)) + geom_line(aes(group=1),size=2,color="red") + xlab("Day of the Week") +
    theme(axis.title.x=element_blank(),axis.text.y = element_text(color="blue",size=10,angle=0,hjust=1,vjust=0),
          axis.text.x = element_text(color="blue",size=10,angle=0,hjust=.5,vjust=.5),
         axis.title.y = element_text(size=11),
         plot.title=element_text(size=12,color="purple",hjust=0.5))
    }
In [129]:
g1=g(filter(crimes,PrimaryType=="THEFT"))+ggtitle("Theft")+ylab("Total Count")
g2=g(filter(crimes,PrimaryType=="BATTERY"))+ggtitle("BATTERY")+ylab("Total Count")
g3=g(filter(crimes,PrimaryType=="CRIMINAL DAMAGE"))+ggtitle("CRIMINAL DAMAGE")+ylab("Total Count")
g4=g(filter(crimes,PrimaryType=="NARCOTICS"))+ggtitle("NARCOTICS")+ylab("Total Count")
grid.arrange(g1,g2,g3,g4,ncol=2)

. From the plots above, we see that theft is most common on Friday. Battery and criminal damage, on the other hand, are highest at weekend. We also observe that narcotics decreases over weekend.

We can also see the pattern of the above four crime types by hour:

In [88]:
g=function(data){
    DayHourCounts = as.data.frame(table(data$Weekday, data$Hour))
    DayHourCounts$Hour = as.numeric(as.character(DayHourCounts$Var2))
    ggplot(DayHourCounts, aes(x=Hour, y=Freq)) + geom_line(aes(group=Var1, color=Var1), size=1.4)+ylab("Count")+
    theme(axis.title.x=element_text(size=14),axis.text.y = element_text(color="blue",size=11,angle=0,hjust=1,vjust=0),
      axis.text.x = element_text(color="blue",size=11,angle=0,hjust=.5,vjust=.5),
     axis.title.y = element_text(size=14),
     legend.title=element_blank(),
     plot.title=element_text(size=16,color="purple",hjust=0.5))
    }
In [89]:
g1=g(filter(crimes,PrimaryType=="THEFT"))+ggtitle("Theft")+ylab("Total Count")
g2=g(filter(crimes,PrimaryType=="BATTERY"))+ggtitle("BATTERY")+ylab("Total Count")
g3=g(filter(crimes,PrimaryType=="CRIMINAL DAMAGE"))+ggtitle("CRIMINAL DAMAGE")+ylab("Total Count")
g4=g(filter(crimes,PrimaryType=="NARCOTICS"))+ggtitle("NARCOTICS")+ylab("Total Count")
grid.arrange(g1,g2,g3,g4,ncol=2)