spark-notes

This project is maintained by spoddutur

Knowledge Browser

Problem Statement: Despite being the first class citizen in Spark, holding the key corporate asset i.e., data, Datasets are not getting enough attention it needs in terms of making them searchable.

Cover Points:

In this blog, we’ll look at following items:

  1. How can we make datasets searchable
    • esp. without using any search engine
    • SQL like queries for search
  2. Same data, different representations
    • Rearrange the same data in different schema’s
    • Understand the impact of data schema on search time.
  3. Conclusion - Performance testing
    • Evaluate each of these approaches and see which is more appropriate way to structure the datasets when it comes to making them searchable.

1. How can we make datasets searchable

I’ve implemented a working sample application using worldbank open data that does the following:

Demo

out

Please refer to my git repository here for further details.

2. Same data, different representations:

I’ve used ~4 million countries profile information as knowledge base from world bank open data in this project. Following table displays some sample rows: screen shot 2017-08-03 at 11 45 41 pm

I tried different ways to structure this data and evaluated their performance using simple search by country id query. Let’s jump in and take a look at what are these different schema’s that I tried and how they performed when queried by CountryId…

2.1 Data Frames:

The very first attempt to structure the data was naturally the simplest of all i.e., DataFrames, where all the information of the country is in one row.

Schema:

image

Query by CountryId response time: 100ms

2.2 RDF Triplets

Next, I represented the same data as RDF triplets. In this approach, we basically, take each row and convert it into triplets of Subject, Predicate and Object as shown in the table below:

Schema:

Query by CountryId response time: 6501ms

2.3 RDF Triplets as LinkedData

Next, I represented the same data as RDF triplets of linked data. The only difference between earlier approach and this one is that, the Subject here is linked to a unique id which inturn holds the actual info as shown below:

Schema:

Query by CountryId response time: 25014ms

2.4 Graph Frames

The last attempt that I tried was to structure the data as graph with vertices, edges, and attributes. Below picture gives an idea of how country info looks like in this approach:

Schema:

3. Conclusion:

I wrote this blog to demonstrate:

For this, I tried different ways to structure the data and evaluated its performance. I hope it helps you and gives a better perspective in structuring your data for your Spark application.

3.1 Key takeouts:

3.2 Note:

The search query here, essentially filters the datasets and returns the results i.e., it is a filter() transformation applied on data. So, the observed response times per schema not only applies to search but it also applies to any transformations that we apply on spark data. This experiment definetely tells us how big is the impact of dataschema on the performance of your spark application. Happy data structuring !!