Querying HBase using Spark

For more information and examples, see HBase Example Using HBase Spark Connector.
  1. Provide the Spark user to perform CRUD operation in HBase using "hbase" user:
    sudo -u hbase bash
    kinit -kt /etc/security/keytabs/hbase.headless.keytab <Spark-user>  
    hbase shell
    grant 'spark', 'RWXCA'
    exit
  2. Sign-in to Ranger.
  3. Click the HBase service.
  4. Add or update policy to give access "create,read,write,execute" to the Spark user.
  5. Sign-in with Spark user account and create a table in HBase:
    sudo su spark  
    (kinit with spark if required)
    hbase shell
    hbase(main):001:0> create 'person', 'p', 'c'
  6. Start spark-shell:
    spark-shell --jars
          /usr/lib/hbase/hbase-spark.jar,/usr/lib/hbase/hbase-spark-protocol-shaded.jar,/usr/lib/hbase/*
          --files /etc/hbase/conf/hbase-site.xml --conf
          spark.driver.extraClassPath=/etc/hbase/conf
  7. Insert and read data using spark-shell:
    • Inserting data:
      val sql = spark.sqlContext
       
      import java.sql.Date
       
      case class Person(name: String,
      email: String,
      birthDate: Date,
      height: Float)
       
      var personDS = Seq(
      Person("alice", "alice@alice.com", Date.valueOf("2000-01-01"), 4.5f),
      Person("bob", "bob@bob.com", Date.valueOf("2001-10-17"), 5.1f)
      ).toDS
       
      personDS.write.format("org.apache.hadoop.hbase.spark")
      .option("hbase.columns.mapping",
      "name STRING :key, email STRING c:email, " +
      "birthDate DATE p:birthDate, height FLOAT p:height")
      .option("hbase.table", "person")
      .option("hbase.spark.use.hbasecontext", false)
      .save()

      Results:

      shell> scan 'person'
      ROW       COLUMN+CELL
       alice    column=c:email, timestamp=1568723598292, value=alice@alice.com
       alice    column=p:birthDate, timestamp=1568723598292, value=\x00\x00\x00\xDCl\x87 \x00
       alice    column=p:height, timestamp=1568723598292, value=@\x90\x00\x00
       bob      column=c:email, timestamp=1568723598521, value=bob@bob.com
       bob      column=p:birthDate, timestamp=1568723598521, value=\x00\x00\x00\xE9\x99u\x95\x80
       bob      column=p:height, timestamp=1568723598521, value=@\xA333
      2 row(s)
    • Reading data back:
      val sql = spark.sqlContext
      
      val df = sql.read.format("org.apache.hadoop.hbase.spark")
       .option("hbase.columns.mapping",
         "name STRING :key, email STRING c:email, " +
           "birthDate DATE p:birthDate, height FLOAT p:height")
       .option("hbase.table", "person")
       .option("hbase.spark.use.hbasecontext", false)
       .load()
      df.createOrReplaceTempView("personView")
      
      val results = sql.sql("SELECT * FROM personView WHERE name = 'alice'")
      results.show()