Things you should know about Spark: part 1 the basics

Data Analysis Skills 10

BW L.

Published in

Data Engineering Insight

6 min readJan 1, 2022

This post is part of the Hadoop Series.

HDFS
Kerberos
Hive
Spark Part 1 (this post)
Spark Part 2

This post and the next couple ones cover basics of Spark and other topics you should know to use spark correctly. The version of Spark in this post is 2.1.1 and after.

You might have heard that Spark does all the processing in memory, therefore extremely fast. Indeed, it is faster when you know how spark works and configure it correctly. But it’s often overkill. Many data processing tasks can just be done using Hive, maybe a lot faster in some cases. So the first question you should ask yourself is:

Should I use Hive or Spark?

If you can use hive SQL to get the data you need, use Hive. If Hive SQL becomes extremely complicated for the data processing logic, consider use Spark. Spark is good for fast data processing with large dataset. When Hive is slow, a correctly written spark job might be faster. But Spark is not easy to master, you should understand the basics before go ahead and write Spark code.

The Basics

Where can Spark run on?

Spark can run standalone on a single machine, on a cluster of servers that is built just for spark or on Hadoop cluster. The most common case is to run on Hadoop so that the same cluster can be used for multiple type of workloads like MapReduce, Hive, Spark, Pig. We’ll only cover the case of Spark running on a Hadoop cluster with YARN as resource manager.

YARN? Job? Application?

YARN is the resource manager of most Hadoop distributions. It manages both storage (HDFS) and compute (i.e. Hive, spark), also orchestrates jobs running in the cluster (spin up and shut down containers).

In YARN’s world, everything runs as YARN containers. An YARN application can be a MapReduce job, Hive job or a Spark session. They are all applications from YARN’s point of view. Each application can have many containers working together to process data. In the scope of this post, we only consider Spark applications.

For each application submitted to the Hadoop cluster, there will be a unique application ID assigned to the application. You can later kill this application if needed.

There will be a minimum memory size of container configured at the cluster. We’ll assume it’s 5GB in this post. Each container will get the minimum memory if not specified during the application submission time.

Below are some useful commands to interact with YARN:

List all YARN applications
yarn application -list
Kill an application yarn application -kill <applicaiton_id> where<application_id> can be obtained from the first command or right after an application is submitted to YARN.

YARN Queue

In an enterprise Hadoop cluster, there will be YARN queues setup for different teams. Queues are defined by a percentage of total memory or CPU of the cluster. Each queue can be configured to use Active Directory group to control who can submit a job to the queue. In this settings, in order to run any job (Hive, Spark) a user will need to have access to a queue through the Active Directory group that is configured by the cluster admin.

Spark Session

A Spark session is what we’ll refer to a Spark application submitted and run in a certain queue inside the Hadoop cluster.

A spark session has a number of containers working together to process data. There are two kinds of containers in a Spark session: the driver and executors. The Spark Driver (master) container manages the entire session and communicate with the worker nodes (executors).

Where is the Spark driver running?

There are two ways (modes) that Spark session can be setup depends on where the spark driver is running.

Client mode. When the spark driver runs on the Hadoop server (client node) where the user submits the Spark job to the cluster, it’s called running on yarn-client mode. The benefit of this mode is that the driver output will be printed out and easy to debug the code.
Cluster mode. The Spark driver runs on a data node. There’s no interaction with user. It’s good for scheduling production jobs.

The only difference of these two mode is where the driver is running.

Large Driver Memory = Faster execution?

The answer is no! Do not use a large driver that is greater than 5G memory. Most of the time you don’t need it and you are holding the memory from the Hadoop server. It is important that you don’t use a large driver by default. You only need the driver to have large memory when you are using client mode and trying to bring large amount of data using toPandas() as a Pandas dataframe.

How much executor memory?

Since executor runs on data nodes, there is less concern on memory. But start small! Start with 5G (remember the min container memory size?) and increment with multiples of 5G.

Be aware of zombie spark process

When running in client mode, the driver runs on the Hadoop server. If your spark job is killed by the cluster or administrators because it’s running for a very long time (more that 15 hours), the driver is mostly going to be still running on the hadoop server where the spark code is submitted. You need to manually kill the driver process.

The following command will print out your spark jobs:

ps -ef | grep $USER | grep spark

The output may look like:

john 1447 1300 6 Jan 19 ? 00:08:23 java org.apache.spark.deploy.SparkSubmit --conf spark.executor.memory=10g --conf spark.dynamicallocation.executoridletimeout=60 --conf spark.master=yarn-client --conf spark.driver.memory=10g --conf spark.sql.catalogImplementation=hive --conf spark.scheduler.minregisteredresourcesratioy=1 --conf spark.app.name=spark-cluster-pyconnect --conf spark.ui.port=9998 --conf spark.dynamicallocation.initialexecutors=1 --conf spark.dynamicAllocation.maxExecutors=100 --conf spark.dynamicAllocation.minExecutors=5 --conf spark.dynamicAllocation.initialExecutors=5 --conf spark.dynamicAllocation.enabled=true pyspark-shell

The above spark job also has an issue: Large driver: 10G is too much!

Once you confirm that you don’t need it, kill the process by taking the second field of the output 1447 (PID)

kill -9 1447

Spark UI

Due to the natural of big data, spark jobs could run a very long time before it finishes or errors out. The first thing you should do after submitting a spark job is to use spark UI to check the progress of the spark job. You can find the spark UI from YARN resource manager UI by looking for the spark job that you just submitted. Look for the Application master column.

How to determine the size of spark session?

Below are some steps to size spark session.

First find out the sizes of all the hive tables you’ll be using in the spark session. See the hive post for details. One thing to pay attention is that the size from the hadoop fs command might not be actual size when data is read to memory. It is what’s taking space currently while it’s stored on HDFS. If it’s compressed, the size of the data when read into memory will be multiple times of the compressed size.

Do the math

The spark cluster’s total executor memory should be at least 3 times of the data to process.

Driver memory: You don’t need large driver memory to process data. Keep it at 2G if you can.
Executor memory. Start with the size of the minimum YARN container in the cluster, for example, 5GB. If you end up need over 20 executors, you can increase in 5GB up to 20GB. If executor memory is too large, the cluster might not be able to accommodate the request. The Spark job will then be in accepted state for a very long time.
Increase executor cores to 2 or 3 will give you some more speed. DO NOT use more than 4 executor cores!

Basic Principals

Do not use the same configuration of # executors, executor memory for every spark job. Each job should be evaluated differently for these numbers.
Do not use large driver memory (> 4GB). It’ll take the limited memory of the server you are using, instead of using the Hadoop cluster.
Do not set max executor to be more than 200. You can increase the executor core to get more parallelism.
Use executors to process data instead of driver

What’s next

Now that we are done with the basics, please move forward to the next post on more in-depth content on configuration and more…

Spark Part 2