Fisseha Berhane, PhD

Data Scientist

CV Resume Linkedin GitHub twitter twitter

Benefits of and Tips on Hortonworks Apache Spark Certification



Recently, I took hands-on, performance-based certification for Spark on the Hortonworks Data Platform (HDPCD), and in this article I will share what benefits one gets from the certification process, and some tips on how to prepare for it. Before attempting the certification, you have to master the methods and functions that are important to perform the tasks you will be asked during the exam. Even if you will be provided with the Apache Spark documentation, believe me , you will not get time to refer to the documentation. You will be given seven questions and you have to answer five of them correctly to pass the exam. There is no partial answer. Even if you answer a question 90% of the steps correctly, if you do not complete it 100% correct, you will not get any point. For any question, you get either zero or one. Therefore, to pass the test, it is a good idea to answer all of the questions to increase your probability of passing because if you answer only five of them and if there is a small error in one of the the solutions you provided, you will fail the test. Since you have only two hours to answer the questions, the best idea is to go thoroughly through the spark documentation and practice examples with the methods and functions provided. Further, you have to commit to memory as many of the function names as possible. Since you will do the test in the terminal, there is no auto-completion like what we do in an IDE environment. Hence, you have to be aware of the case of the functions and methods. I believe this will help you in your job since you will not be frequently checking the documentation and copying code snippets, you will be more productive. The more things you can perform without referring to the documentation, the less time you will take to implement your big data applications. The other advantage of the certification is related with the breadth of the capabilities of Apache Spark. At work, we may be using it for certain kind of analysis and using few functions with one of the APIs with a limited set of actions and transformations but for exam since we have to know a wide range of actions and transformations, this gives us the opportunity to explore, learn and master the apache Spark Framework. We may have experience with the DataFrame API or with Spark SQL, for the exam, however, we have to be comfortable with using RDDS, DataFrames and Spark SQL. We have to know how to change an RDD to DataFrame and analyze it using the DataFrame API and Spark SQL and be able to save it to a Hive Table or save it to HDFS. We have also to master how to change a DataFrame to an RDD and save it in HDFS. Being well-versed on how to integrate the different APIs with the low level RDDs gives us flexibility and productivity efficiency at work.

Some tips

Tip one:

Be comfortable with how to transform a DataFrame to an RDD. In PySpark, we can do like below. Note, here, I am converting the DataFrame to an RDD and separating each field with tab. Here, DF is the DataFrame. You can do it in a couple of different ways, but the one below is more succinct and less prone to errors.

In [ ]:
DF.rdd.map(lambda line: '\t'.join([str(item) for item in line]))

The code above does not keep the column names. Hence, if you want to save your result with the column names, you have to use spark-csv library, which is discussed below. The other important point is NULL values in the dataframe will be "None" when you convert it to an RDD using the code above and hence fields that will be empty when saved using spark-csv will be None when saved by first creating the above RDD.

Tip 2:

Master how to convert an RDD to a DataFrame. This is very useful specially when your data has many fields (say 50) and you want only some of the fields (say 5) to answer your question. Let's consider tab delimited data that has 50 fields and let's assume we want fields 1, 10, 15, 25 and 30 to answer the question at hand. There are a couple of different ways to answer this question, but the one below could be the easiest. DF is the DataFrame we get from the RDD. Note, the Row() function and toDF() method. Note, Python uses zero-based indexing. As you can see, I am also changing the columns to appropriate data types.

In [ ]:
from pyspark.sql import Row

rdd = sc.textFile("path to file")

DF = rdd.map(lambda line: Row(field_name1 = float(line.split("\t")[0])),
                              field_name2 = line.split("\t")[9]),
                              field_name3 = int(line.split("\t")[14])),
                              field_name4 = line.split("\t")[24]),
                              field_name5 = float(line.split("\t")[29]))).toDF()

In Spark 2+, You could solve the problem above by using DataFrameReader.csv but Hortonworks are still giving the test with Spark 1.6. Therefore, either you have to use the approach I showed above or you can use the Spark-csv library. To use spark-csv, you have to initialize Spark with the necessary package. In PySpark, it should be something like below. In the terminal, write:

In [ ]:
pyspark --packages com.databricks:spark-csv_2.10:1.5.0

Then you can read and write using spark-csv as below

In [ ]:
df = sqlContext.sql.read.format("com.databricks.spark.csv").options(header = True, inferSchema = True).load('path to file')

df.write.format("com.databricks.spark.csv").options(header = True).save('path to output')

Even when you use spark-csv, if you are to use only some of the fields to answer your question, the RDD to DataFrame approach shown above could be easier. Otherwise, either you have to pass the schema while reading in the data or you have to convert the data types one by one because the inferSchema = True above may not be reliable. Plus, it takes more time, when you have large volume of data, since it scans the data to infer the schema. Remember, if you do not care about the data type, you can read the data without setting inferSchema = True. It is false by default and all the columns will be read as strings. If you want to pass the schema, for data that has many fields, it may not be an enjoyable experience, but it should be something like shown below:

In [ ]:
from pyspark.sql.types import *

my_schema  = StructType([StructField("field_name1", FloatType(), True),
                    StructField("field_name2", StringType(), True),
                    ...
                    ...
                    ..
                    StructField("field_name48", DoubleType(), True),
                    StructField("field_name49", StringType(), True )
                    ])

Then, you can pass the schema as below.

In [ ]:
df = sqlContext.sql.read.format("com.databricks.spark.csv").options(header = True).load('path to file', schema = my_schema)

Actually, if you want only some of the columns, you do not need to specify the data type of each one of them. Example, see below, I have specified the column name and the data type of the columns that I want to answer my question with. The rest, I passed as StringType and made the naming of the columns easier so it does not take me much time.

In [ ]:
my_schema  = StructType([StructField("country", StringType(), True),
                    StructField("col2", StringType(), True),
                    StructField("col3", StringType(), True),
                    StructField("population", IntegerType(), True),
                    StructField("col5", StringType(), True),
                    StructField("col6", StringType(), True),
                    ...
                    ...
                    StructField("GDP", DoubleType(), True),
                    StructField("perCapita", DoubleType(), True),
                    ..
                    ..
                    StructField("developed", BooleanType(), True),
                    StructField("field_name49", StringType(), True )
                    ])

Now, you can process the data using the DataFrame API or Spark SQL using hive and save it in various formats including orc, parquet, json, etc., or insert it into a hive database.

Tip 3

Know the caveats of accumulators.

  1. Accumulators inside transformations such as map, filter, etc, will not be executed
  2. An action has to be called for an accumulator to be executed.
  3. The results in an accumulator may not be updated when the task is run more than once.

One simple use of accumulator in PySpark is with foreach action as shown below.

In [ ]:
acc = sc.accumulator(0)
def my_accumulator(line):
    some comdition:
        acc.add(1)

rdd.foreach(my_accumulator)   # rdd is the RDD that I want to use accumulator with

Now we can access the accumulator value as below:

In [ ]:
acc.value

Tip 4

Start with the easier questions. Solving the easier questions first will boost your confidence. Further, since you have to answer five questions 100% correctly to pass the exam, it is wise to make sure that you answer the easier questions correctly. You will be given two hours to solve seven questions. It is a good idea to solve all of the questions to reduce the probability of failing the exam.

Finally,

  1. know how to use spark-submit with jar spark application and Python/Scala scripts to run with yarn, and client and cluster modes.
  2. Time is short, so practice well before attempting the test.
  3. Have a large screen and high Internet speed. The test could be a frustrating experience if you have low speed Internet or if you screen is small such as a laptop. The font size is small and you also have to open a couple of windows to do the test. Therefore, I recommend a large screen.

If you have any questions, feel free to drop them below or send me an email.

comments powered by Disqus