Getting the Most From Community Gardens

Many are looking to community gardens now for increased food security. They won’t find it if they use them the same old way. While community gardens produce many social, health and environmental…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




A Scenic Route through PySpark Internals

A Data Analyst’s account.

I had been using Apache Sparks’s Python API, AKA PySpark, for over an year when one day, I ran into this error:

We’ve all seen this one, right?

Usually this leads to me trolling through similar errors on Stack Overflow, trying the best voted solutions until I find one that works. At which point I heave a well deserved sigh and head towards the coffee machine.

Not this time.

After spending what seemed like eternity, jiggling different combinations of whatever knobs I stumbled upon, it finally dawned on me. The day had come for me to actually read the PySpark distribution code and try to figure out how the damned thing really works. This is my attempt to chronicle what followed.

Thing is, Spark documentation, detailed though it is, does not explore the inner workings of PySpark. What’s more, I could not find a single blog, video or any other resource that would hand hold one through the actual code lines. For now, I was on my own.

The PySpark API is a “very thin layer” on top of the Java API for spark, which itself is only a wrapper around the core Scala functionality. What this means is, that when you use the PySpark API, even though the actual ‘data-processing’ is done by Python processes, data persistence and transfer are still handled by the Spark JVM. Things like scheduling (both DAG and Task), broadcast, networking, fault-recovery etc. are all reused from the core Scala Spark package. This also means that the Java and Python processes need a means to talk to each other.

The diagram above, from the aforementioned wiki entry, gives a brief overview of the data-transfers involved in running a typical PySpark application. To quote the wiki:

Curiouser and curiouser. The explanation answers some questions but begs new ones. What is Py4j? Do I run two spark contexts in one PySpark app? And why are there special PythonRDD objects? Time to crack the hood.

But where does one even begin. Let’s take it one step at a time. What is the one piece of code that signals to you that you are now in the realm of Spark. Probably something like this:

Let’s locate the pertinent code (I’m using Spark 2.3.2, but all the code in this post should work with any 2.x.y version):

SparkContext apparently belongs to a `context` module in pyspark

OK, like the doc says, all you really need to create a SparkContext is an appName and a master URL. You can provide these either via an SparkConf object (as we did in the code above) or through named parameters.

This is valid too.

It also optionally takes as parameters a gateway and a jsc (a Java Spark Context). We’ll come back to these in a minute.

First, a brief overview of what goes on when you create a new SparkContext. I’ll rely on a pseudo-code like format, hoping to gain in brevity and clarity what I lose in accuracy. Keep in mind the ‘Local’ half of the data-flow diagram we discussed, and try to identify the portions where an arrow is being set-up. Here goes:

Note : I’ve also removed gateway and jsc parameters to simplify my representation. Simply put, if these are provided when creating the SparkContext, we won’t create new ones. This also helps us capture a typical SparkContext invocation.

Okay, so every SparkContext (the big white box in the diagram) has an associated gateway (the grey box marked Py4j), and that gateway is linked with a JVM. There can only be one SparkContext per JVM. And we somehow associate a JavaSparkContext (the inner grey box) with the JVM. Also, we create a temp folder to help with the bigger data transfers.

Makes sense? Kind of. But what is a gateway and why do we need it? Let’s dig deeper:

Again, with the pseudo code :

Done? Ok. I’ll just reiterate a couple of salient points about Py4j:

Well… not really. But, in a very lose sense, it could be considered that. It is a part of the Java API for spark and a way for the JVM to access the core Scala packages. And, since the Python API sits on top of the Java API, it is Python’s entry point into functionality written in Scala.
Convinced? Yes? Cool.

Hmm.. where have I read that before. Yes, you have now reached Castle Aaargh. This is the actual context on top of which the “very thin layer” of PySpark sits. Pay your respects and leave before the guard blows his nose in your general direction.

So, hopefully things seems a bit clearer now. Probably the error traceback makes a bit more sense? No? Then open up the source code and read through the myriad things this post glossed over.

Also, keep in mind that we only discussed the ‘Local’ half of data-flow in a PySpark app. If you want an explanation of why Spark written in Python is so much slower that Scala, you need to explore the ‘Cluster’ half. Might I suggest opening up the worker and rdd modules? Try to understand in particular cases where you incur serialization (and de-serialization) costs. Read about Unix Pipes.

Path to worker and rdd modules

Which brings me to one final point. Now that we know how PySpark interacts with the underlying JVM, what’s stopping us from using custom Scala libraries in our PySpark code?

There could be a number of reasons you would need to do this. Probably there is a Scala library that the engineering team whipped up which is much faster (albeit less readable) than the one you wrote in Python. Or maybe you want a shared data abstraction layer between the jobs written in Python and and those written in Scala.

Here’s some code to get you started:

We are going to use this function in our PySpark Shell (you could do the same thing via spark-submit). Launch your PySpark shell like so:

Done? Then follow along with the code in the screenshot below.

Try doing the same thing with a Spark SQL Dataframe.

Congrats! You’ve now used a Spark Scala library in Python. And that is what the PySpark API is all about.

Add a comment

Related posts:

The Forgotten Behavior Equation

Can a single equation depict the whole working of our behavior? Is that simple to simplify them? No. But this equation contains everything you need to know about building good habits, shaping your…

7 steps to implement Dagger 2 in Android

So I finally got a chance to explore the new Dagger 2 in a project recently. And while there are a lot of resources online about what is Dagger and why dagger 2 is necessary and how to implement…

Mi primera experiencia como Embajadora

Hace un tiempo apliqué para ser Embajadora (colaboradora) de @jompeame un proyecto dominicano que a través de una plataforma digital capta fondos para brindar asistencia a distintas causas sociales…