spark-notes

This project is maintained by spoddutur

Did you ever thought of updating or re-broadcasting a broadcast variable?

Why would we need this?

Dealing with such streaming applications which need a way to weave (filter, map etc) the streaming data using a changing reference data (from DB, files etc) has become a relatively common use-case.

Is this requirement only a relatively-common use-case?

I believe that this is more than just being a relatively-common use-case in the world of Machine Learning applications or Active Learning systems. Let me illustrate the situations which will help us understand this necessity:

What is common in both these cases?

The reference data, be it the cluster centroids or the phrases mined, in both the tasks would need to:

  1. Broadcast it to have a local cached copy per executor and
  2. Iteratively keep refining this broadcasted cache.
  3. Most importantly, the reference-data that we are learning/mining is very small.

For the cases discussed above, one would think that we want a way to broadcast our periodically changing reference data. But, given that such cases have very small sized reference data, is it really needed to have a local copy per executor? Let’s see alternative perspectives in which we can think to handle such cases.

Why should we not think of workarounds to update broadcast variable?

Before going further into alternative perspectives, please do note that the Broadcast object is not Serializable and needs to be final. So, stop thinking about or searching for a solution to update it.

Now, hopefully, you are also in the same page as me to stop thinking of modifying a broadcast variable. Let’s explore the right approaches to handle such cases..

Solutions:

In this blog, am going to discuss how to handle following listed requirements:

  1. How to keep track of periodically changing ref-data as more and more data is seen.
  2. How to keep track of periodically changing ref-data + weave it with our streaming application.

I’ve discussed two completely contrasting solutions covering this spectrum as listed below:

1. Spark-Native Solutions:

I’ve discussed how to handle the two problems at hand within Spark:

Hope this helps!!