spark-notes

This project is maintained by spoddutur

Continuation from part2

Weaving a periodically changing cached-data with your streaming application…

Problem Statement

In part2, We saw how to keep-track of periodically changing cached-data. Now, we not only want to track it but also weave it (i.e., apply filter, map transformations) with our input dataframes or rdd’s of our streaming application?

Nomenclature: I’ve coined cached-data alternatively as refdata or reference-data in this blog.

Naive thoughts to handle this:

I’ve noticed people thinking of crazy ideas such as:

Spark-Native Solution:

I’ve proposed a spark-native solution to handle this case in much more sensible manner compared to the naive approaches mentioned above. Below TableOfContents walks you through the solution:

  1. Core idea: Short briefing of the solution
  2. Demo: Implementation of the solution
  3. Theory: Discuss the theory to understand the workings behind this spark-native solution better.
  4. Conclusion: Lastly, I’ll conclude with some key take-outs of the spark-native solutions discussed in part2 and part3.

1. Core Idea:

Per every batch: a. We can broadcast the cached-data with any latest changes to it b. Update our input using this broadcasted (latest) cached-data and c. Lastly unpersist/remove the broadcasted cache (not the cached data - but the braodcasted cache). Repeat steps a through c per every batch

Demo:

How do we do this?


// Generic function that broadcasts refData, invokes `f()` and unpersists refData.
def withBroadcast[T: ClassTag, Q](refData: T)(f: Broadcast[T] => Q)(implicit sc: SparkContext): Q = {
    val broadcast = sc.broadcast(refData)
    val result = f(broadcast)
    broadcast.unpersist()
    result
}

// transforming inputRdd using refData
def updateInput(inputRdd, broadcastedRefData): updatedRdd {
  inputRdd.mapPartition(input => {
   // access broadcasted refData to change input
  })
}

// invoking `updateInput()` using withBroadcast()
val updatedRdd = withBroadcast(refData) { broadcastedRefData =>
  updateInput(inputRdd, broadcastedRefData))
}

Conclusion:

Some key takeouts of part1 and part2 solutions which are primarily spark-native (i.e., not using any external service):

So far, we looked at spark-native solutions to address the issue. Now, let’s widen the horizon and look outside spark here

Alternatives:

One can also think of converting the Reference Data to an RDD, then join it with input streams using either join() or zip() in such a way that we now have streaming Pair<MyObject, RefData>.

My HomePage

References: