Project Name

Sparklens visualisation extension

Mentor Details

Name: 

Mayur Bhosale

Organisation:

Qubole

Designation:

Distributed Systems Engineer

Project Description

Sparklens is an open-source Spark Application profiler. Debugging and tuning the Spark application is a tedious task and requires a subject matter expert and even after that validating the suggestion can turn out to be an expensive affair. Sparklens helps in narrowing down the bottlenecks of the application and also helps in setting the optimal number of executors using a built-in simulator. Currently, sparklens writes the output to the command line and it looks something like this:https://github.com/qubole/sparklens#what-does-it-report and is not at all intuitive. We need a local static service which can take this output (internally it's a JSON file) and create a static web UI version of this.

Expected outcome -
Locally running web UI (This is a reference UI: http://sparklens.qubole.com/report_view/1b3868a49388e7ab6a16) wherein the user is able to navigate through the pages. Apart from the link mentioned above there are additional components which needs to be added.

If the time permits, and the student is interested, we ca take up the enhancements in sparklens core as well and submit them to the open source project.

Programming Languages

Any of the Js frameworks, Scala

Project Pre-requisites

Basic understanding of git. If the students ends up working on the core/sparklens related tasks basic knowledge of Java/Scala is required)

Project Duration (in Months)

1.5

Number of openings

2

Project Difficulty

Novice

Additional Information

Spark application tuning is a difficult problem and there are many companies/projects which are trying to tackle this. Here are few of the references: https://www.pepperdata.com/, https://github.com/linkedin/dr-elephant
Looking into the performance side is great way to understand the internals of the distributed system.

Proposal requirements

Try to research a bit about open source distributed computing frameworks - Spark, Hive, Presto and try to write a 2-3 line summary of each explaining there pros and cons.

Have questions or feedback? Interested in working with us?  Email us at connectinternlink@gmail.com