# Fisseha Berhane, PhD

## Benefits of and Tips on Hortonworks Apache Spark Certification¶

### Some tips¶

Tip one:

Be comfortable with how to transform a DataFrame to an RDD. In PySpark, we can do like below. Note, here, I am converting the DataFrame to an RDD and separating each field with tab. Here, DF is the DataFrame. You can do it in a couple of different ways, but the one below is more succinct and less prone to errors.

In [ ]:
DF.rdd.map(lambda line: '\t'.join([str(item) for item in line]))


The code above does not keep the column names. Hence, if you want to save your result with the column names, you have to use spark-csv library, which is discussed below. The other important point is NULL values in the dataframe will be "None" when you convert it to an RDD using the code above and hence fields that will be empty when saved using spark-csv will be None when saved by first creating the above RDD.

Tip 2:

Master how to convert an RDD to a DataFrame. This is very useful specially when your data has many fields (say 50) and you want only some of the fields (say 5) to answer your question. Let's consider tab delimited data that has 50 fields and let's assume we want fields 1, 10, 15, 25 and 30 to answer the question at hand. There are a couple of different ways to answer this question, but the one below could be the easiest. DF is the DataFrame we get from the RDD. Note, the Row() function and toDF() method. Note, Python uses zero-based indexing. As you can see, I am also changing the columns to appropriate data types.

In [ ]:
from pyspark.sql import Row

rdd = sc.textFile("path to file")

DF = rdd.map(lambda line: Row(field_name1 = float(line.split("\t")[0])),
field_name2 = line.split("\t")[9]),
field_name3 = int(line.split("\t")[14])),
field_name4 = line.split("\t")[24]),
field_name5 = float(line.split("\t")[29]))).toDF()


In Spark 2+, You could solve the problem above by using DataFrameReader.csv but Hortonworks are still giving the test with Spark 1.6. Therefore, either you have to use the approach I showed above or you can use the Spark-csv library. To use spark-csv, you have to initialize Spark with the necessary package. In PySpark, it should be something like below. In the terminal, write:

In [ ]:
pyspark --packages com.databricks:spark-csv_2.10:1.5.0


Then you can read and write using spark-csv as below

In [ ]:
df = sqlContext.sql.read.format("com.databricks.spark.csv").options(header = True, inferSchema = True).load('path to file')



Even when you use spark-csv, if you are to use only some of the fields to answer your question, the RDD to DataFrame approach shown above could be easier. Otherwise, either you have to pass the schema while reading in the data or you have to convert the data types one by one because the inferSchema = True above may not be reliable. Plus, it takes more time, when you have large volume of data, since it scans the data to infer the schema. Remember, if you do not care about the data type, you can read the data without setting inferSchema = True. It is false by default and all the columns will be read as strings. If you want to pass the schema, for data that has many fields, it may not be an enjoyable experience, but it should be something like shown below:

In [ ]:
from pyspark.sql.types import *

my_schema  = StructType([StructField("field_name1", FloatType(), True),
StructField("field_name2", StringType(), True),
...
...
..
StructField("field_name48", DoubleType(), True),
StructField("field_name49", StringType(), True )
])


Then, you can pass the schema as below.

In [ ]:
df = sqlContext.sql.read.format("com.databricks.spark.csv").options(header = True).load('path to file', schema = my_schema)


Actually, if you want only some of the columns, you do not need to specify the data type of each one of them. Example, see below, I have specified the column name and the data type of the columns that I want to answer my question with. The rest, I passed as StringType and made the naming of the columns easier so it does not take me much time.

In [ ]:
my_schema  = StructType([StructField("country", StringType(), True),
StructField("col2", StringType(), True),
StructField("col3", StringType(), True),
StructField("population", IntegerType(), True),
StructField("col5", StringType(), True),
StructField("col6", StringType(), True),
...
...
StructField("GDP", DoubleType(), True),
StructField("perCapita", DoubleType(), True),
..
..
StructField("developed", BooleanType(), True),
StructField("field_name49", StringType(), True )
])


Now, you can process the data using the DataFrame API or Spark SQL using hive and save it in various formats including orc, parquet, json, etc., or insert it into a hive database.

Tip 3

Know the caveats of accumulators.

1. Accumulators inside transformations such as map, filter, etc, will not be executed
2. An action has to be called for an accumulator to be executed.
3. The results in an accumulator may not be updated when the task is run more than once.

One simple use of accumulator in PySpark is with foreach action as shown below.

In [ ]:
acc = sc.accumulator(0)
def my_accumulator(line):
some comdition:

rdd.foreach(my_accumulator)   # rdd is the RDD that I want to use accumulator with


Now we can access the accumulator value as below:

In [ ]:
acc.value


Tip 4

Start with the easier questions. Solving the easier questions first will boost your confidence. Further, since you have to answer five questions 100% correctly to pass the exam, it is wise to make sure that you answer the easier questions correctly. You will be given two hours to solve seven questions. It is a good idea to solve all of the questions to reduce the probability of failing the exam.

Finally,

1. know how to use spark-submit with jar spark application and Python/Scala scripts to run with yarn, and client and cluster modes.
2. Time is short, so practice well before attempting the test.
3. Have a large screen and high Internet speed. The test could be a frustrating experience if you have low speed Internet or if you screen is small such as a laptop. The font size is small and you also have to open a couple of windows to do the test. Therefore, I recommend a large screen.

If you have any questions, feel free to drop them below or send me an email.