Fisseha Berhane, PhD

Data Scientist

443-970-2353 fisseha@jhu.edu CV Resume Linkedin GitHub twitter twitter



Data Manipulation with Python Pandas and R Data.Table

Pandas is a commonly used data manipulation library in Python. Data.Table, on the other hand, is among the best data manipulation packages in R. Data.Table is succinct and we can do a lot with Data.Table in just a single line. Further, data.table is, generally, faster than Pandas (see benchmark here) and it may be a go-to package when performance is a constraint. For someone who knows one of these packages, I thought it could help to show codes that perform the same tasks in both packages side by side to help them quickly study the other. If you know either of them and want to learn the other, this blog post is for you.

The data is Hair and Eye Color of Statistics Students from this website.




Python Pandas

R Data.Table

In [42]:
# Import Library

import pandas as pd
In [1]:
library(data.table)
In [3]:
# Read data from a url

url = "https://vincentarelbundock.github.io/Rdatasets/csv/datasets/HairEyeColor.csv"

df = pd.read_csv(url)
In [2]:
url = "https://vincentarelbundock.github.io/Rdatasets/csv/datasets/HairEyeColor.csv"

df = fread(url)
In [4]:
# Type of the df object

type(df)
Out[4]:
pandas.core.frame.DataFrame
In [7]:
class(df)

# data.table is data.frame
  1. 'data.table'
  2. 'data.frame'
In [5]:
# Column names

list(df)
Out[5]:
['Unnamed: 0', 'Hair', 'Eye', 'Sex', 'Freq']
In [8]:
names(df)
  1. 'V1'
  2. 'Hair'
  3. 'Eye'
  4. 'Sex'
  5. 'Freq'
In [6]:
# Show first few rows

df.head()
Out[6]:
Unnamed: 0 Hair Eye Sex Freq
0 1 Black Brown Male 32
1 2 Brown Brown Male 53
2 3 Red Brown Male 10
3 4 Blond Brown Male 3
4 5 Black Blue Male 11
In [9]:
head(df)
V1HairEyeSexFreq
1 BlackBrownMale 32
2 BrownBrownMale 53
3 Red BrownMale 10
4 BlondBrownMale 3
5 BlackBlue Male 11
6 BrownBlue Male 50
In [7]:
# Show last few rows

df.tail()
Out[7]:
Unnamed: 0 Hair Eye Sex Freq
27 28 Blond Hazel Female 5
28 29 Black Green Female 2
29 30 Brown Green Female 14
30 31 Red Green Female 7
31 32 Blond Green Female 8
In [10]:
tail(df)
V1HairEyeSexFreq
27 Red Hazel Female 7
28 Blond Hazel Female 5
29 Black Green Female 2
30 Brown Green Female14
31 Red Green Female 7
32 Blond Green Female 8
In [8]:
# Data type of each column

df.dtypes
Out[8]:
Unnamed: 0     int64
Hair          object
Eye           object
Sex           object
Freq           int64
dtype: object
In [11]:
str(df)
Classes 'data.table' and 'data.frame':	32 obs. of  5 variables:
 $ V1  : chr  "1" "2" "3" "4" ...
 $ Hair: chr  "Black" "Brown" "Red" "Blond" ...
 $ Eye : chr  "Brown" "Brown" "Brown" "Brown" ...
 $ Sex : chr  "Male" "Male" "Male" "Male" ...
 $ Freq: int  32 53 10 3 11 50 10 30 10 25 ...
 - attr(*, ".internal.selfref")=<externalptr> 
In [9]:
# Return number of columns and rows of dataframe

df.shape
Out[9]:
(32, 5)
In [12]:
dim(df)
  1. 32
  2. 5
In [10]:
#  Number of rows

len(df.index)
Out[10]:
32
In [13]:
nrow(df)
32
In [11]:
# Number of columns

len(df.columns)
Out[11]:
5
In [14]:
ncol(df)
5
In [12]:
# Basic statistics

df.describe()
Out[12]:
Unnamed: 0 Freq
count 32.000000 32.000000
mean 16.500000 18.500000
std 9.380832 18.242099
min 1.000000 2.000000
25% 8.750000 7.000000
50% 16.500000 10.000000
75% 24.250000 29.250000
max 32.000000 66.000000
In [15]:
summary(df)
      V1                Hair               Eye                Sex           
 Length:32          Length:32          Length:32          Length:32         
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
      Freq      
 Min.   : 2.00  
 1st Qu.: 7.00  
 Median :10.00  
 Mean   :18.50  
 3rd Qu.:29.25  
 Max.   :66.00  
In [13]:
# Extract first three rows

df[0:3]

   # or
    
df.iloc[:3]
Out[13]:
Unnamed: 0 Hair Eye Sex Freq
0 1 Black Brown Male 32
1 2 Brown Brown Male 53
2 3 Red Brown Male 10
In [16]:
df[1:3]
V1HairEyeSexFreq
1 BlackBrownMale 32
2 BrownBrownMale 53
3 Red BrownMale 10
In [14]:
# Filter for black hair

df[df['Hair']=="Black"]

     # or
    
df.query("Hair =='Black'")
Out[14]:
Unnamed: 0 Hair Eye Sex Freq
0 1 Black Brown Male 32
4 5 Black Blue Male 11
8 9 Black Hazel Male 10
12 13 Black Green Male 3
16 17 Black Brown Female 36
20 21 Black Blue Female 9
24 25 Black Hazel Female 5
28 29 Black Green Female 2
In [6]:
df[Hair == "Black"]


# or 

setkey(df, Hair)
df["Black"]
HairEyeSexFreq
Black Brown Male 32
Black Blue Male 11
Black Hazel Male 10
Black Green Male 3
Black Brown Female36
Black Blue Female 9
Black Hazel Female 5
Black Green Female 2
In [15]:
# Filter for males who have black hair

df[(df['Hair']=="Black")  & (df["Sex"]=="Male")]


# or
df.query("Hair == 'Black' & Sex =='Male'")
Out[15]:
Unnamed: 0 Hair Eye Sex Freq
0 1 Black Brown Male 32
4 5 Black Blue Male 11
8 9 Black Hazel Male 10
12 13 Black Green Male 3
In [19]:
df[Hair == "Black" & Sex == "Male"]


# or

setkey(df, Hair, Sex)

df[.("Black","Male")]
V1HairEyeSexFreq
5 BlackBlue Male 11
1 BlackBrownMale 32
13 BlackGreenMale 3
9 BlackHazelMale 10
In [16]:
# Filter for those who have brown eye or black hair

z = df[(df['Hair']=="Black") | (df["Eye"]=="Brown")]


# or
z = df.query("Hair == 'Black' | Eye =='Brown'")

z.head(6)
Out[16]:
Unnamed: 0 Hair Eye Sex Freq
0 1 Black Brown Male 32
1 2 Brown Brown Male 53
2 3 Red Brown Male 10
3 4 Blond Brown Male 3
4 5 Black Blue Male 11
8 9 Black Hazel Male 10
In [8]:
z = df[Hair== "Black" | Eye == "Brown"]

head(z)
HairEyeSexFreq
Black Brown Male 32
Black Blue Male 11
Black Hazel Male 10
Black Green Male 3
Black Brown Female36
Black Blue Female 9
In [17]:
# Filter for eye color of blue, hazel and green

df[df.Eye.isin(['Blue','Hazel','Green'])].head()
Out[17]:
Unnamed: 0 Hair Eye Sex Freq
4 5 Black Blue Male 11
5 6 Brown Blue Male 50
6 7 Red Blue Male 10
7 8 Blond Blue Male 30
8 9 Black Hazel Male 10
In [21]:
z = df[Eye %in% 
       c('Blue','Hazel',
         'Green')]

head(z)

# or

setkey(df, Eye)

z = df[c("Blue","Hazel", "Green")]
head(z)
V1HairEyeSexFreq
21 Black Blue Female 9
5 Black Blue Male 11
29 Black Green Female 2
13 Black Green Male 3
25 Black Hazel Female 5
9 Black Hazel Male 10
In [21]:
# Select one column

df[["Eye"]].head()

# or

df.Eye.head()
Out[21]:
Eye
0 Brown
1 Brown
2 Brown
3 Brown
4 Blue
In [22]:
z = df[,.(Eye)]
head(z)
Eye
Blue
Blue
Blue
Blue
Blue
Blue
In [22]:
# Select two columns

df[["Eye","Sex"]].head()
Out[22]:
Eye Sex
0 Brown Male
1 Brown Male
2 Brown Male
3 Brown Male
4 Blue Male
In [23]:
z = df[,.(Eye,Sex)]
head(z)
EyeSex
Blue Female
Blue Male
Blue Female
Blue Male
Blue Female
Blue Male
In [23]:
# Unique Eye colors

df["Eye"].unique()
Out[23]:
array(['Brown', 'Blue', 'Hazel', 'Green'], dtype=object)
In [24]:
df[,unique(Eye)]
  1. 'Blue'
  2. 'Brown'
  3. 'Green'
  4. 'Hazel'
In [24]:
# Maximum of the "Freq" column

df.Freq.max()
Out[24]:
66
In [25]:
df[,max(Freq)]
66
In [25]:
# Call functions on multiple columns 

import numpy as np

pd.DataFrame({'Max_freq': [df.Freq.max()], \
             'Min_freq': [df.Freq.min()],\
             'Std_freq': [np.std(df.Freq)]})
Out[25]:
Max_freq Min_freq Std_freq
0 66 2 17.954804
In [26]:
df[,.(Max_freq =
      max(Freq),
      Min_freq =
      min(Freq),
      Std_freq =
      sd(Freq))]
Max_freqMin_freqStd_freq
66 2 18.2421
In [26]:
# Maximum Frequency by Sex

df.groupby("Sex").agg({"Freq":"max"})
Out[26]:
Freq
Sex
Female 66
Male 53
In [27]:
df[, max(Freq),
    by = Sex]
SexV1
Female66
Male 53
In [27]:
# Count by Eye color and Sex

df.groupby(["Eye","Sex"]).agg({"Freq":"count"}).rename(columns={"Freq":"Count"})
Out[27]:
Count
Eye Sex
Blue Female 4
Male 4
Brown Female 4
Male 4
Green Female 4
Male 4
Hazel Female 4
Male 4
In [28]:
df[,.(Count = .N), 
   by = .(Eye, Sex)]
EyeSexCount
Blue Female4
Blue Male 4
Brown Female4
Brown Male 4
Green Female4
Green Male 4
Hazel Female4
Hazel Male 4
In [28]:
# Call functions for grouping

df.assign(Gt50 = (df.Freq > 50)).groupby("Gt50").agg({"Gt50":"count"})\
.rename(columns ={"Gt50":"Count"})
Out[28]:
Count
Gt50
False 29
True 3
In [29]:
df[,.(Count = .N), 
   by = .(Gt50 = 
             Freq > 50)]
Gt50Count
FALSE29
TRUE 3
In [44]:
# Do the analysis on selected rows only

pd.DataFrame({'Max_freq': [df[0:10].Freq.max()], \
             'Min_freq': [df[0:10].Freq.min()],\
             'Std_freq': [np.std(df[0:10].Freq)]})
Out[44]:
Max_freq Min_freq Std_freq
0 53 3 16.770212
In [3]:
df[1:10,.(Max_freq =
      max(Freq),
      Min_freq =
      min(Freq),
      Std_freq =
      sd(Freq))]
Max_freqMin_freqStd_freq
53 3 17.67736
In [30]:
# Remove a column

df.drop('Unnamed: 0', 1).head()
Out[30]:
Hair Eye Sex Freq
0 Black Brown Male 32
1 Brown Brown Male 53
2 Red Brown Male 10
3 Blond Brown Male 3
4 Black Blue Male 11
In [4]:
head(df[, V1 := NULL])
HairEyeSexFreq
BlackBrownMale 32
BrownBrownMale 53
Red BrownMale 10
BlondBrownMale 3
BlackBlue Male 11
BrownBlue Male 50
In [32]:
# Return the first occurance

df.query("Eye == 'Blue'")[:1]
Out[32]:
Unnamed: 0 Hair Eye Sex Freq
4 5 Black Blue Male 11
In [36]:
setkey(df, Eye)
df["Blue", mult = "first"]
V1HairEyeSexFreq
5 BlackBlue Male 11
In [33]:
# Return the last occurance

df.query("Eye == 'Blue'")[-1:]
Out[33]:
Unnamed: 0 Hair Eye Sex Freq
23 24 Blond Blue Female 64
In [37]:
setkey(df, Eye)
df["Blue", mult = "last"]
V1HairEyeSexFreq
24 Blond Blue Female64
In [34]:
# Return a count

df[df.Eye.isin(['Blue','Hazel']) & (df.Sex=="Male")].shape[0]
Out[34]:
8
In [48]:
setkey(df, Eye, Sex)

df[.(c("Blue","Hazel"), "Male"), .N]
8
In [35]:
# Count for each group

df[df.Eye.isin(['Blue','Hazel']) & (df.Sex=="Male")].\
groupby(["Eye","Sex"]).agg({"Freq":"count"}).rename(columns={"Freq":"Count"})
Out[35]:
Count
Eye Sex
Blue Male 4
Hazel Male 4
In [44]:
setkey(df, Eye, Sex)
df[.(c("Blue","Hazel"),  
      "Male"), .N,
   by =.EACHI]
EyeSexN
Blue Male 4
HazelMale 4
In [36]:
# Order in ascending order

df.sort_values(by='Freq').tail(6)
Out[36]:
Unnamed: 0 Hair Eye Sex Freq
21 22 Brown Blue Female 34
16 17 Black Brown Female 36
5 6 Brown Blue Male 50
1 2 Brown Brown Male 53
23 24 Blond Blue Female 64
17 18 Brown Brown Female 66
In [49]:
tail(df[order(Freq)])
V1HairEyeSexFreq
22 Brown Blue Female34
17 Black Brown Female36
6 Brown Blue Male 50
2 Brown Brown Male 53
24 Blond Blue Female64
18 Brown Brown Female66
In [37]:
# Order in descending order

df.sort_values(by='Freq', ascending = False).tail(6)
Out[37]:
Unnamed: 0 Hair Eye Sex Freq
11 12 Blond Hazel Male 5
27 28 Blond Hazel Female 5
19 20 Blond Brown Female 4
3 4 Blond Brown Male 3
12 13 Black Green Male 3
28 29 Black Green Female 2
In [50]:
tail(setorder(df, -Freq))
V1HairEyeSexFreq
28 Blond Hazel Female5
12 Blond Hazel Male 5
20 Blond Brown Female4
4 Blond Brown Male 3
13 Black Green Male 3
29 Black Green Female2
In [38]:
# "Freq" in descending and "Eye" in ascending

df.sort_values(by=['Freq','Eye'], ascending = [False,True]).tail(6)
Out[38]:
Unnamed: 0 Hair Eye Sex Freq
24 25 Black Hazel Female 5
27 28 Blond Hazel Female 5
19 20 Blond Brown Female 4
3 4 Blond Brown Male 3
12 13 Black Green Male 3
28 29 Black Green Female 2
In [51]:
tail(setorder(df, -Freq, Eye))
V1HairEyeSexFreq
28 Blond Hazel Female5
12 Blond Hazel Male 5
20 Blond Brown Female4
4 Blond Brown Male 3
13 Black Green Male 3
29 Black Green Female2
In [39]:
# Rename columns

df.rename(columns = {"Freq":"Frequency","Eye":"Eye_Color"}).tail()
Out[39]:
Unnamed: 0 Hair Eye_Color Sex Frequency
27 28 Blond Hazel Female 5
28 29 Black Green Female 2
29 30 Brown Green Female 14
30 31 Red Green Female 7
31 32 Blond Green Female 8
In [53]:
setnames(df,c("Freq","Eye"),c("Frequency", "Eye_Color"))
tail(df)
V1HairEye_ColorSexFrequency
28 Blond Hazel Female5
12 Blond Hazel Male 5
20 Blond Brown Female4
4 Blond Brown Male 3
13 Black Green Male 3
29 Black Green Female2
In [40]:
# Unique rows

df[["Eye","Sex"]].drop_duplicates()
Out[40]:
Eye Sex
0 Brown Male
4 Blue Male
8 Hazel Male
12 Green Male
16 Brown Female
20 Blue Female
24 Hazel Female
28 Green Female
In [56]:
unique(df[,.(Eye_Color,Sex)])
Eye_ColorSex
Brown Female
Blue Female
Brown Male
Blue Male
Hazel Female
Hazel Male
Green Male
Green Female
In [41]:
# Create new column

df.assign(Eye_Hair =df.Eye + df.Hair)[["Eye","Hair","Eye_Hair"]].head()
Out[41]:
Eye Hair Eye_Hair
0 Brown Black BrownBlack
1 Brown Brown BrownBrown
2 Brown Red BrownRed
3 Brown Blond BrownBlond
4 Blue Black BlueBlack
In [65]:
z = df[, Eye_Hair := paste(Eye_Color, Hair)][,.(Eye_Color, Hair, Eye_Hair)]

head(z)
Eye_ColorHairEye_Hair
Brown Brown Brown Brown
Blue Blond Blue Blond
Brown Brown Brown Brown
Blue Brown Blue Brown
Brown Black Brown Black
Blue Brown Blue Brown
comments powered by Disqus