A 5min POC using google sheets in Zeppelin.

Testing how easy it would be to drop data into a spreadsheet, then analyze it using Zeppelin.

First go into https://docs.google.com and create a new spreadsheet, which would be used for collecting your data. Alternatively, my copy of bank.csv, which is used in this example, can be cloned for yours.

Next, you will want to create a shareable link for this spreadsheet, by clicking on [Share] on the upper right hand corner of your Google Sheet:

Next, copy the shareable link:

If you don’t have access to Zeppelin service, you can initialize one using docker, which would make zeppelin accessible through http://localhost:8080

docker run apache/zeppelin:0.8.0

This shareable link ( https://docs.google.com/spreadsheets/d/16VVaWKbCMzzZMZum5BBLQ6jEHn-DAzIvd02NOO5ApP4/edit?usp=sharing )  is broken down into the https://docs.google.com/spreadsheets/d/%5Bsheet-id%5D/edit?usp=sharing

Next use that ID in the following link format in your Zeppelin notebook, to download and process the CSV as follows:

import org.apache.commons.io.IOUtils
import java.net.URL
import java.nio.charset.Charset

// Zeppelin creates and injects sc (SparkContext) and sqlContext (HiveContext or SqlContext)
// So you don't need create them manually
val fileId="16VVaWKbCMzzZMZum5BBLQ6jEHn-DAzIvd02NOO5ApP4"
val googleCsv = "https://docs.google.com/spreadsheets/u/0/d/"+ fileId+ "/export?format=csv&id="+fileId+"&gid=0"
// load bank data
val bankText = sc.parallelize(
        new URL(googleCsv),

case class Bank(age: Integer, job: String, marital: String, education: String, balance: Integer)

val bank = bankText.map(s => s.split(",")).filter(s => s(0) != "age").map(
    s => Bank(s(0).toInt, 
            s(1).replaceAll("\"", ""),
            s(2).replaceAll("\"", ""),
            s(3).replaceAll("\"", ""),
            s(5).replaceAll("\"", "").toInt

Next, go ahead and execute your notebook and run queries against the data as you need.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.