Most of the data out there is unstructured, and Spark is an excellent tool for analyzing this type of data. Here, we will analyze the Bible and the Quran. We will see the distribution of words, the most common words in both scriptures and the average frequency. This could also be scaled to find the most common words and distribution of all words on the Internet. The books have been retrieved from Project Gutenberg. The Bible can be downloaded from here and the Quran from here.
pyspark is the Spark Python API that exposes the Spark programming model to Python. SparkContext is main entry point for Spark functionality. In Spark, communication occurs between a driver and executors. The driver has Spark jobs that it needs to run and these jobs are split into tasks that are submitted to the executors for completion. The results from these tasks are delivered back to the driver. In order to use Spark and its API we will need to use a SparkContext. When running Spark, you start a new Spark application by creating a SparkContext. When the SparkContext is created, it asks the master for some cores to use to do work. The master sets these cores aside just for you; they won't be used for other applications. In the code below, we are also specifying some configuration parameters, including the fact that this Spark session is to use local machine since I am using my PC for this tutorial.
from pyspark import SparkContext, SparkConf
conf = SparkConf().setAppName("miniProject").setMaster("local[*]")
sc = SparkContext.getOrCreate(conf)
A Spark context can be used to create Resilient Distributed Datasets (RDDs) on a cluster.
To convert a text file into an RDD, we use the SparkContext.textFile() method.
bibleRDD = sc.textFile("bible.txt")
quranRDD = sc.textFile("quran.txt")
collect return a list that contains all of the elements in the RDD.
bibleRDD.sample(withReplacement = False, fraction = 0.0002, seed = 80).collect()
quranRDD.sample(withReplacement = False, fraction = 0.0002, seed = 80).collect()
Let's count the number of lines in each RDD.
print('The number of lines in the Bible text file is {}'.format(bibleRDD.count()))
print('The number of lines in the Quran text file is {}'.format(quranRDD.count()))
Words should be counted independent of their capitialization. So, we will change all words to lower case. We will also remove all punctuations. Further, any leading or trailing spaces on a line should be removed.
The function below removes all characters which are not alpha-numeric except space(s). It also changes them to lower letter and removes leading or trailing spaces. As you can see we are using the python module re.
import re
def wordclean(x):
return re.sub("[^a-zA-Z0-9\s]+","", x).lower().strip()
Let's check the above function:
x = [" The Sun rises in the East and sets in the West!\n "
" He said, 'I am sure you know the answer!\n' "]
for i in x:
print(wordclean(i))
Now, can apply it to our Bible and Quran RDDS. We use the map RDD method.
bibleRDDList = bibleRDD.map(lambda x : wordclean(x))
quranRDDList = quranRDD.map(lambda x : wordclean(x))
Now, let's see how the RDDList files above look like. As shown below, all punctuation have been removed and all letters are lower-case.
bibleRDDList.take(60)[41: ]
quranRDDList.take(450)[414 : ]
Apply a transformation that will split each element of the RDD by its spaces. For each element of the RDD, we are appling Python's string split() function. Note, we are using the flatMap here.
bibleRDDwords = bibleRDDList.flatMap( lambda x: x.split(" "))
quranRDDwords = quranRDDList.flatMap( lambda x: x.split(" "))
Let's show sample words from each RDD.
bibleRDDwords.sample(withReplacement = False, fraction = 0.00001, seed = 90).collect()
quranRDDwords.sample(withReplacement = False, fraction = 0.00005, seed = 90).collect()
Now, let's remove spaces. We use the filter method to achieve this.
bibleRDDwords = bibleRDDwords.filter(lambda x: len(x) != 0)
quranRDDwords = quranRDDwords.filter(lambda x: len(x) != 0)
Next, let's create word pairs. This helps us to count the frequency of each word and to select the most common words in each RDD.
bibleRDDwordPairs = bibleRDDwords.map(lambda x: (x,1))
quranRDDwordPairs = quranRDDwords.map(lambda x: (x, 1))
Let's show the first ten elements of each RDD.
bibleRDDwordPairs.take(10)
quranRDDwordPairs.take(10)
Now, we can find the frequency of each word.
The reduceByKey() transformation gathers together pairs that have the same key and applies a function to two associated values at a time. reduceByKey() operates by applying the function first within each partition on a per-key basis and then across the partitions.
bibleRDDwordCount = bibleRDDwordPairs.reduceByKey(lambda a, b : a + b)
quranRDDwordCount = quranRDDwordPairs.reduceByKey(lambda a, b : a + b)
bibleRDDwordCount.take(10)
quranRDDwordCount.take(10)
The takeOrdered() action returns the first n elements of the RDD, using either their natural order or a custom comparator. The key advantage of using takeOrdered() instead of first() or take() is that takeOrdered() returns a deterministic result, while the other two actions may return differing results, depending on the number of partions or execution environment. takeOrdered() returns the list sorted in ascending order. Note below, we are using -x[1] to make it in descending order.
bibleRDDwordCount.takeOrdered(10, lambda x : -x[1])
quranRDDwordCount.takeOrdered(10, lambda x : -x[1])
Next, let's remove stop words from our RDDs. Note that all old English stop words may not be included in the list of python stop words we are using here.
import nltk
nltk.download("stopwords")
from nltk.corpus import stopwords
stopwords = stopwords.words('english')
len(stopwords)
stopwords[0:10]
bibleRDDwordCount = bibleRDDwordCount.filter(lambda x : x[0] not in stopwords)
quranRDDwordCount = quranRDDwordCount.filter(lambda x : x[0] not in stopwords)
Now, we see the most frequent words from each RDD after removing the stop words. As shown below, God is the most frequent word in the Quran but sixth most frequent word in the Bible. In the bible, the word lord, which usually means God, is third most frequent word. Lord is fifth most common word in the Quran.
bibleRDDwordCount.takeOrdered(10, lambda x : -x[1])
quranRDDwordCount.takeOrdered(15, lambda x : -x[1])
But how many unique words do we have now in each RDD?
unique_words_bible = bibleRDDwordCount.count()
unique_words_quran = quranRDDwordCount.count()
print(" The total number of unique words in the bible is {} while the unique number of words in the Quran is {}".\
format(unique_words_bible, unique_words_quran ))
To find the average occurence of a word, let's find the total number of words and divide that by the number of unique words.
total_words_bible = bibleRDDwordCount.map(lambda a: a[1]).reduce(lambda a, b : a + b)
print("Total number of words in the Bible: {}".format(total_words_bible))
total_words_quran = quranRDDwordCount.map(lambda a: a[1]).reduce(lambda a, b : a + b)
print("Total number of words in the Quran: {}".format(total_words_quran))
Average_word_count_bible = total_words_bible/unique_words_bible
Average_word_count_quran = total_words_quran/unique_words_quran
print('Average word frequency in the Bible is {} while the average word frequency in the Quran is {}'.\
format(round(Average_word_count_bible,1), round(Average_word_count_quran,1)))
we can now analyze the distribution of the words using standard python libraries such as numpy, pandas and matplotlib.
import numpy as np
Below, we are changing the word frequencies in the RDDs to numpy arrays and ploting them using matplotlib.
bibleRDDwordCount_numeric_values = bibleRDDwordCount.map(lambda x : x[1]).collect()
quranRDDwordCount_numeric_values = quranRDDwordCount.map(lambda x : x[1]).collect()
We can see the first ten elements of each list as below.
bibleRDDwordCount_numeric_values[:10]
quranRDDwordCount_numeric_values[:10]
Below, we are converting the lists to numpy arrays.
bibleRDDwordCount_numeric_values_np = np.array(bibleRDDwordCount_numeric_values)
quranRDDwordCount_numeric_values_np = np.array(quranRDDwordCount_numeric_values)
Check type of one of them:
type(bibleRDDwordCount_numeric_values_np)
import matplotlib.pyplot as plt
%matplotlib inline
plt.figure(figsize = (20, 10))
plt.hist(np.log10(bibleRDDwordCount_numeric_values_np), color = "orange")
plt.title("Distribution of words in the Bible", fontsize = 28)
plt.xlabel("Log scale", fontsize = 16)
plt.xticks(size = 18)
plt.yticks(size = 18)
plt.show()
plt.figure(figsize = (20, 10))
plt.hist(np.log10(quranRDDwordCount_numeric_values_np))
plt.title("Distribution of words in the Quran", fontsize = 28)
plt.xlabel("Log scale", fontsize = 16)
plt.xticks(size = 18)
plt.yticks(size = 18)
plt.show()
From the above histograms, we see that most words have frequencies less than 10.
Now, let's create a dataframe using the top 15 most common words.
import pandas as pd
bible_top15_words = bibleRDDwordCount.takeOrdered(15, lambda x : -x[1])
quran_top15_words = quranRDDwordCount.takeOrdered(15, lambda x : -x[1])
bible_words = [x[0] for x in bible_top15_words]
bible_count = [x[1] for x in bible_top15_words]
bible_dict = {"word": bible_words, "frequency": bible_count}
quran_words = [x[0] for x in quran_top15_words]
quran_count = [x[1] for x in quran_top15_words]
quran_dict = {"word": quran_words, "frequency": quran_count}
df_bible = pd.DataFrame(bible_dict)
df_quran = pd.DataFrame(quran_dict)
df_bible.head()
df_quran.tail()
Finally, let's create a bar chart of the 15 most common words from each scripture.
my_plot = df_bible.plot(figsize = (20, 10),
x = "word", y = "frequency", kind = "barh", legend = False )
my_plot.invert_yaxis()
plt.title("Frequency of the most common words in the Bible", fontsize = 28)
plt.xticks(size = 18)
plt.yticks(size = 18)
plt.ylabel("")
plt.show()
my_plot = df_quran.plot(figsize = (20, 10),
x = "word", y = "frequency", kind = "barh", legend = False )
my_plot.invert_yaxis()
plt.title("Frequency of the most common words in the Quran", fontsize = 28)
plt.xticks(size = 18)
plt.yticks(size = 18)
plt.ylabel("")
plt.show()
In this tutorial, we analyzed the Bible and the Quran using Spark, particularly the pyspark module. We calculated the average word frequency, the most common words and distribution of words in each scripture. God is the most frequent word in the Quran but sixth most frequent word in the Bible. In the bible, the word lord, which usually means God, is third most frequent word. Lord is fifth most common word in the Quran. We see that most words have frequencies less than 10. I plan to post various Spark tutorials and if you are interested in Spark, stay tuned.