Knowledge Browser

Problem Statement: Despite being the first class citizen in Spark, holding the key corporate asset i.e., data, Datasets are not getting enough attention it needs in terms of making them searchable.

Cover Points:

In this blog, we’ll look at following items:

How can we make datasets searchable
- esp. without using any search engine
- SQL like queries for search
Same data, different representations
- Rearrange the same data in different schema’s
- Understand the impact of data schema on search time.
Conclusion - Performance testing
- Evaluate each of these approaches and see which is more appropriate way to structure the datasets when it comes to making them searchable.

1. How can we make datasets searchable

I’ve implemented a working sample application using worldbank open data that does the following:

A UI Dashboard built on top of Spark to browse knowledge (a.k.a data)
Real-time query spark and visualise it as graph.
Supports SQL query syntax.
This is just a sample application to get an idea on how to go about building any kind of analytics dashboard on top of the data that Spark processed. One can customize it according to their needs.

Demo

out

Please refer to my git repository here for further details.

2. Same data, different representations:

I’ve used ~4 million countries profile information as knowledge base from world bank open data in this project. Following table displays some sample rows: screen shot 2017-08-03 at 11 45 41 pm

I tried different ways to structure this data and evaluated their performance using simple search by country id query. Let’s jump in and take a look at what are these different schema’s that I tried and how they performed when queried by CountryId…

2.1 Data Frames:

The very first attempt to structure the data was naturally the simplest of all i.e., DataFrames, where all the information of the country is in one row.

Schema:

Query by CountryId response time: 100ms

2.2 RDF Triplets

Next, I represented the same data as RDF triplets . In this approach, we basically, take each row and convert it into triplets of Subject, Predicate and Object as shown in the table below:

Schema:

Query by CountryId response time: 6501ms

2.3 RDF Triplets as LinkedData

Next, I represented the same data as RDF triplets of linked data. The only difference between earlier approach and this one is that, the Subject here is linked to a unique id which inturn holds the actual info as shown below:

Schema:

Query by CountryId response time: 25014ms

2.4 Graph Frames

The last attempt that I tried was to structure the data as graph with vertices, edges, and attributes. Below picture gives an idea of how country info looks like in this approach:

Schema:

Number of Vertices: 4,278,235
Number of Edges: 15,357,957
Query by CountryId response time: 7637ms

3. Conclusion:

I wrote this blog to demonstrate:

How to make datasets searchable and
The impact of data schema on search response time.

For this, I tried different ways to structure the data and evaluated its performance. I hope it helps you and gives a better perspective in structuring your data for your Spark application.

3.1 Key takeouts:

When the nature of your data is homogenous, capturing all the information of a record in a single row gives the best performance in terms of search time.
If you are dealing with heterogenous data, where all the entities cannot share a generic schema, then simple RDF triplets yields better response times compared to LinkedData RDF Triplets representation.
Though higher response time is a downside of LinkedData, it is the recommended approach to connect, expose and make your data available for semantic search.
GraphFrames is very intuitive for user to structure the data in many cases. Its response times are comparable to RDF triplets search and they also open up doors to exploit graph algorithms like triangle count, connected commonents, BFS etc

3.2 Note:

The search query here, essentially filters the datasets and returns the results i.e., it is a filter() transformation applied on data. So, the observed response times per schema not only applies to search but it also applies to any transformations that we apply on spark data. This experiment definetely tells us how big is the impact of dataschema on the performance of your spark application. Happy data structuring !!