This project is maintained by spoddutur
Problem Statement: Despite being the first class citizen in Spark, holding the key corporate asset i.e., data, Datasets are not getting enough attention it needs in terms of making them searchable.
In this blog, we’ll look at following items:
I’ve implemented a working sample application using worldbank open data that does the following:
Please refer to my git repository here for further details.
I’ve used ~4 million countries profile information as knowledge base from world bank open data in this project. Following table displays some sample rows:
I tried different ways to structure this data and evaluated their performance using simple search by country id
query.
Let’s jump in and take a look at what are these different schema’s that I tried and how they performed when queried by CountryId…
The very first attempt to structure the data was naturally the simplest of all i.e., DataFrames, where all the information of the country is in one row.
Schema:
Query by CountryId response time: 100ms
Next, I represented the same data as . In this approach, we basically, take each row and convert it into triplets of Subject, Predicate and Object
as shown in the table below:
Schema:
Query by CountryId response time: 6501ms
Next, I represented the same data as RDF triplets of linked data. The only difference between earlier approach and this one is that, the Subject here is linked to a unique id which inturn holds the actual info as shown below:
Schema:
Query by CountryId response time: 25014ms
The last attempt that I tried was to structure the data as graph with vertices, edges, and attributes. Below picture gives an idea of how country info looks like in this approach:
Schema:
I wrote this blog to demonstrate:
For this, I tried different ways to structure the data and evaluated its performance. I hope it helps you and gives a better perspective in structuring your data for your Spark application.
The search query here, essentially filters the datasets and returns the results i.e., it is a filter() transformation applied on data. So, the observed response times per schema not only applies to search but it also applies to any transformations that we apply on spark data. This experiment definetely tells us how big is the impact of dataschema on the performance of your spark application. Happy data structuring !!