Store Spark df to HBase

Store Spark df to HBase - java

I am trying to store a Spark DataSet to HBase in an efficient way. When we tried to do something like that with a lambda in JAVA :
sparkDF.foreach(l->this.hBaseConnector.persistMappingToHBase(l,"name_of_hBaseTable") );
The function persistMappingtoHBase uses the HBase Java client (Put) to store in HBase.
I get an exception: Exception in thread "main" org.apache.spark.SparkException: Task not serializable
Then we tried this:
sparkDF.foreachPartition(partition -> {
final HBaseConnector hBaseConnector = new HBaseConnector();
hBaseConnector.connect(hbaseProps);
while (partition.hasNext()) {
hBaseConnector.persistMappingToHBase(partition.next());
}
hBaseConnector.closeConnection();
});
which seems to be working but it seems quite inefficient, I guess because we create and close a connection for each row of the dataframe.
What is a good way to store a spark DS to HBase ? I saw a connector developed by IBM but never used it.

The following can be used to save the content to HBase
val hbaseConfig = HBaseConfiguration.create
hbaseConfig.set("hbase.zookeeper.quorum", "xx.xxx.xxx.xxx")
hbaseConfig.set("hbase.zookeeper.property.clientPort", "2181")
val job = Job.getInstance(hbaseConfig)
job.setOutputFormatClass(classOf[TableOutputFormat[_]])
job.getConfiguration.set(TableOutputFormat.OUTPUT_TABLE, "test_table")
val result = sparkDF.map(row -> {
// Using UUID as my rowkey, you can use your own rowkey
val put = new Put(Bytes.toBytes(UUID.randomUUID().toString))
// setting the value of each row to Put object
....
....
new Tuple2[ImmutableBytesWritable, Put](new ImmutableBytesWritable(), put)
});
// save result to hbase table
result.saveAsNewAPIHadoopDataset(job.getConfiguration)
I have following dependencies in my build.sbt file
libraryDependencies += "org.apache.hbase" % "hbase-common" % "1.3.0"
libraryDependencies += "org.apache.hbase" % "hbase-client" % "1.3.0"
libraryDependencies += "org.apache.hbase" % "hbase-server" % "1.3.0"

Related

How to efficiently WordCount for each file?

I have tens of thousands files in dir demotxt like：
demotxt/
aa.txt
this is aaa1
this is aaa2
this is aaa3
bb.txt
this is bbb1
this is bbb2
this is bbb3
this is bbb4
cc.txt
this is ccc1
this is ccc2
I would like to efficiently make a WordCount for each .txt in this dir with Spark2.4 (scala or python)
# target result is:
aa.txt: (this,3), (is,3), (aaa1,1), (aaa2,1), (aaa3,1)
bb.txt: (this,3), (is,3), (bbb1,1), (bbb2,1), (bbb3,1)
cc.txt: (this,3), (is,3), (ccc1,1), (ccc2,1), (ccc3,1)
code maybe like?
def dealWithOneFile(path2File):
res = wordcountFor(path2File)
saveResultToDB(res)
sc.wholeTextFile(rooDir).map(dealWithOneFile)
Seems using sc.textFile(".../demotxt/") spark will load all files which may cause memory issues,also it treats all files as one which is not expected.
So I wonder how should I do this? Many Thanks!

Here is an approach. Can be with DF or RDD. Here I show RDD using Databricks, as you do not state. Scala as well.
It's hard to explain, but works. Try some input.
%scala
val paths = Seq("/FileStore/tables/fff_1.txt", "/FileStore/tables/fff_2.txt")
val rdd = spark.read.format("text").load(paths: _*).select(input_file_name, $"value").as[(String, String)].rdd
val rdd2 = rdd.flatMap(x=>x._2.split("\\s+").map(y => ((x._1, y), 1)))
val rdd3 = rdd2.reduceByKey(_+_).map( { case (x, y) => (x._1, (x._2, y)) } )
rdd3.collect
val rdd4 = rdd3.groupByKey()
rdd4.collect

Flink does not checkpoint, and BucketingSink leaves files in pending state, when source is generated from Collection

I'm trying to generate some test data using a collection, and write that data to s3, Flink doesn't seem to do any checkpointing at all when I do this, but it does do checkpointing when the source comes from s3.
For example, this DOES checkpoint and leaves output files in a completed state:
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
env.setMaxParallelism(128)
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
env.enableCheckpointing(2000L)
env.setStateBackend(new RocksDBStateBackend("s3a://my_bucket/simple_job/rocksdb_checkpoints"))
val lines: DataStream[String] = {
val path = "s3a://my_bucket/simple_job/in"
env
.readFile(
inputFormat = new TextInputFormat(new Path(path)),
filePath = path,
watchType = FileProcessingMode.PROCESS_CONTINUOUSLY,
interval = 5000L
)
}
val sinkFunction: BucketingSink[String] =
new BucketingSink[String]("s3a://my_bucket/simple_job/out")
.setBucketer(new DateTimeBucketer("yyyy-MM-dd--HHmm"))
lines.addSink(sinkFunction)
env.execute()
Meanwhile, this DOES NOT checkpoint, and leaves files in a .pending state even after the job has finished:
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
env.setMaxParallelism(128)
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
env.enableCheckpointing(2000L)
env.setStateBackend(new RocksDBStateBackend("s3a://my_bucket/simple_job/rocksdb_checkpoints"))
val lines: DataStream[String] = env.fromCollection((1 to 100).map(_.toString))
val sinkFunction: BucketingSink[String] =
new BucketingSink[String]("s3a://my_bucket/simple_job/out")
.setBucketer(new DateTimeBucketer("yyyy-MM-dd--HHmm"))
lines.addSink(sinkFunction)
env.execute()

It turns out that this is because of this ticket: https://issues.apache.org/jira/browse/FLINK-2646 and simply comes about because the stream from collection finishes finishes before the app ever has time to make a single checkpoint.

Camel Processing JSON Messages from RabbitMQ

I want to post a message in JSON format to RabbitMQ and have that message consumed successfully. I'm attempting to use Camel to integrate producers and consumers. However, I'm struggling to understand how to create a route to make this happen. I'm using JSON Schema to define the interface between the Producer and Consumer. My application creates JSON, converts it to a byte[] and a Camel ProducerTemplate is used to send the message to RabbitMQ. On the consumer end, the byte[] message needs to be converted to a String, then to JSON, and then marshalled to an Object so I can process it. The following code line doesn't work however
from(startEndpoint).transform(body().convertToString()).marshal().json(JsonLibrary.Jackson, classOf[Payload]).bean(classOf[JsonBeanExample]),
It's as if the bean is passed the original byte[] content and not the object created by JSON json(JsonLibrary.Jackson, classOf[Payload]). All the camel examples I've seen which use the json(..) call seem be followed by a to(..) which is the end of the route? Here is the error message
Caused by: org.apache.camel.InvalidPayloadException: No body available of type: uk.co.techneurons.messaging.Payload but has value: [B#48898819 of type: byte[] on: Message: "{\"id\":1}". Caused by: No type converter available to convert from type: byte[] to the required type: uk.co.techneurons.messaging.Payload with value [B#48898819. Exchange[ID-Tonys- iMac-local-54996-1446407983661-0-2][Message: "{\"id\":1}"]. Caused by: [org.apache.camel.NoTypeConversionAvailableException - No type converter available to convert from type: byte[] to the required type: uk.co.techneurons.messaging.Payload with value [B#48898819]`
I don't really want to use Spring, Annotations etc, would like to service activation as simple as possible. Use Camel as much as possible
This is the producer
package uk.co.techneurons.messaging
import org.apache.camel.builder.RouteBuilder
import org.apache.camel.impl.DefaultCamelContext
object RabbitMQProducer extends App {
val camelContext = new DefaultCamelContext
val rabbitMQEndpoint: String = "rabbitmq:localhost:5672/advert?autoAck=false&threadPoolSize=1&username=guest&password=guest&exchangeType=topic&autoDelete=false&declare=false"
val rabbitMQRouteBuilder = new RouteBuilder() {
override def configure(): Unit = {
from("direct:start").to(rabbitMQEndpoint)
}
}
camelContext.addRoutes(rabbitMQRouteBuilder)
camelContext.start
val producerTemplate = camelContext.createProducerTemplate
producerTemplate.setDefaultEndpointUri("direct:start")
producerTemplate.sendBodyAndHeader("{\"id\":1}","rabbitmq.ROUTING_KEY","advert.edited")
camelContext.stop
}
This is the consumer..
package uk.co.techneurons.messaging
import org.apache.camel.builder.RouteBuilder
import org.apache.camel.impl.DefaultCamelContext
import org.apache.camel.model.dataformat.JsonLibrary
object RabbitMQConsumer extends App {
val camelContext = new DefaultCamelContext
val startEndpoint = "rabbitmq:localhost:5672/advert?queue=es_index&exchangeType=topic&autoDelete=false&declare=false&autoAck=false"
val consumer = camelContext.createConsumerTemplate
val routeBuilder = new RouteBuilder() {
override def configure(): Unit = {
from(startEndpoint).transform(body().convertToString()).marshal().json(JsonLibrary.Jackson, classOf[Payload]).bean(classOf[JsonBeanExample])
}
}
camelContext.addRoutes(routeBuilder)
camelContext.start
Thread.sleep(1000)
camelContext.stop
}
case class Payload(id: Long)
class JsonBeanExample {
def process(payload: Payload): Unit = {
println(s"JSON ${payload}")
}
}
For completeness, this is the sbt file for easy replication..
name := """camel-scala"""
version := "1.0"
scalaVersion := "2.11.7"
libraryDependencies ++= {
val scalaTestVersion = "2.2.4"
val camelVersion: String = "2.16.0"
val rabbitVersion: String = "3.5.6"
val slf4jVersion: String = "1.7.12"
val logbackVersion: String = "1.1.3"
Seq(
"org.scala-lang.modules" %% "scala-xml" % "1.0.3",
"org.apache.camel" % "camel-core" % camelVersion,
"org.apache.camel" % "camel-jackson" % camelVersion,
"org.apache.camel" % "camel-scala" % camelVersion,
"org.apache.camel" % "camel-rabbitmq" % camelVersion,
"com.rabbitmq" % "amqp-client" % rabbitVersion,
"org.slf4j" % "slf4j-api" % slf4jVersion,
"ch.qos.logback" % "logback-classic" % logbackVersion,
"org.apache.camel" % "camel-test" % camelVersion % "test",
"org.scalatest" %% "scalatest" % scalaTestVersion % "test")
}
Thanks

I decided that I needed to create a Bean and Register it (easier said than done! - for some as yet unknown reason JNDIRegistry didn't work with DefaultCamelContext - so I used a SimpleRegistry),
val registry: SimpleRegistry = new SimpleRegistry()
registry.put("myBean", new JsonBeanExample())
val camelContext = new DefaultCamelContext(registry)
Then I changed the consuming routeBuilder - seems like I had been over transforming the message.
from(startEndpoint).unmarshal.json(JsonLibrary.Jackson, classOf[Payload]).to("bean:myBean?method=process")
I also changed the Bean so setter methods were available, and added a toString
class Payload {
#BeanProperty var id: Long = _
override def toString = s"Payload($id)"
}
class JsonBeanExample() {
def process(payload: Payload): Unit = {
println(s"recieved ${payload}")
}
}
The next problem now is to get dead letter queues working, and ensuring that failures in the Bean handler make their way properly back up the stack

JSON4s can't find constructor w/spark

I've run into an issue with attempting to parse json in my spark job. I'm using spark 1.1.0, json4s, and the Cassandra Spark Connector, with DSE 4.6. The exception thrown is:
org.json4s.package$MappingException: Can't find constructor for BrowserData org.json4s.reflect.ScalaSigReader$.readConstructor(ScalaSigReader.scala:27)
org.json4s.reflect.Reflector$ClassDescriptorBuilder.ctorParamType(Reflector.scala:108)
org.json4s.reflect.Reflector$ClassDescriptorBuilder$$anonfun$6.apply(Reflector.scala:98)
org.json4s.reflect.Reflector$ClassDescriptorBuilder$$anonfun$6.apply(Reflector.scala:95)
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
My code looks like this:
case class BrowserData(navigatorObjectData: Option[NavigatorObjectData],
flash_version: Option[FlashVersion],
viewport: Option[Viewport],
performanceData: Option[PerformanceData])
.... other case classes
def parseJson(b: Option[String]): Option[String] = {
implicit val formats = DefaultFormats
for {
browserDataStr <- b
browserData = parse(browserDataStr).extract[BrowserData]
navObject <- browserData.navigatorObjectData
userAgent <- navObject.userAgent
} yield (userAgent)
}
def getJavascriptUa(rows: Iterable[com.datastax.spark.connector.CassandraRow]): Option[String] = {
implicit val formats = DefaultFormats
rows.collectFirst { case r if r.getStringOption("browser_data").isDefined =>
parseJson(r.getStringOption("browser_data"))
}.flatten
}
def getRequestUa(rows: Iterable[com.datastax.spark.connector.CassandraRow]): Option[String] = {
rows.collectFirst { case r if r.getStringOption("ua").isDefined =>
r.getStringOption("ua")
}.flatten
}
def checkUa(rows: Iterable[com.datastax.spark.connector.CassandraRow], sessionId: String): Option[Boolean] = {
for {
jsUa <- getJavascriptUa(rows)
reqUa <- getRequestUa(rows)
} yield (jsUa == reqUa)
}
def run(name: String) = {
val rdd = sc.cassandraTable("beehive", name).groupBy(r => r.getString("session_id"))
val counts = rdd.map(r => (checkUa(r._2, r._1)))
counts
}
I use :load to load the file into the REPL, and then call the run function. The failure is happening in the parseJson function, as far as I can tell. I've tried a variety of things to try to get this to work. From similar posts, I've made sure my case classes are in the top level in the file. I've tried compiling just the case class definitions into a jar, and including the jar in like this: /usr/bin/dse spark --jars case_classes.jar
I've tried adding them to the conf like this: sc.getConf.setJars(Seq("/home/ubuntu/case_classes.jar"))
And still the same error. Should I compile all of my code into a jar? Is this a spark issue or a JSON4s issue? Any help at all appreciated.

how to save a list of case classes in scala

I have a case class named Rdv:
case class Rdv(
id: Option[Int],
nom: String,
prénom: String,
sexe: Int,
telPortable: String,
telBureau: String,
telPrivé: String,
siteRDV: String,
typeRDV: String,
libelléRDV: String,
numRDV: String,
étape: String,
dateRDV: Long,
heureRDVString: String,
statut: String,
orderId: String)
and I would like to save a list of such elements on disk, and reload them later.
I tried with java classes (ObjectOutputStream, fileOutputStream, objectInputStream, fileInputStream) but I have an error in the retrieving step : the statement
val n2 = ois.readObject().asInstanceOf[List[Rdv]]
always get an error(classNotFound:Rdv), although the correct path is given in the imports place.
Do you know a workaround to save such an object?
Please provide a little piece of code!
thanks
olivier
ps: I have the same error while using the Marshall class, such as in this code:
object Application extends Controller {
def index = Action {
//implicit val Rdv2Writes = Json.writes[rdv2]
def rdvTordv2(rdv: Rdv): rdv2 = new rdv2(
rdv.nom,
rdv.prénom,
rdv.dateRDV,
rdv.heureRDVString,
rdv.telPortable,
rdv.telBureau,
rdv.telPrivé,
rdv.siteRDV,
rdv.typeRDV,
rdv.libelléRDV,
rdv.orderId,
rdv.statut)
val n = variables.manager.liste_locale
val out = new FileOutputStream("out")
out.write(Marshal.dump(n))
out.close
val in = new FileInputStream("out")
val bytes = Stream.continually(in.read).takeWhile(-1 !=).map(_.toByte).toArray
val bar: List[Rdv] = Marshal.load[List[Rdv]](bytes) <--------------
val n3 = bar.map(rdv =>
rdvTordv2(rdv))
println("n3:" + n3.size)
Ok(views.html.Application.olivier2(n3))
}
},
in the line with the arrow.
It seems that the conversion to the type List[Rdv] encounters problems, but why? Is it a play! linked problem?
ok, there's a problem with play:
I created a new scala project with this code:
object Test1 extends App {
//pour des fins de test
case class Person(name:String,age:Int)
val liste_locale=List(new Person("paul",18))
val n = liste_locale
val out = new FileOutputStream("out")
out.write(Marshal.dump(n))
out.close
val in = new FileInputStream("out")
val bytes = Stream.continually(in.read).takeWhile(-1 !=).map(_.toByte).toArray
val bar: List[Person] = Marshal.load[List[Person]](bytes)
println(s"bar:size=${bar.size}")
}
and the display is good ("bar:size=1").
then, I modified my previous code in the play project, in the controller class, such as this:
object Application extends Controller {
def index = Action {
//pour des fins de test
case class Person(name:String,age:Int)
val liste_locale=List(new Person("paul",18))
val n = liste_locale
val out = new FileOutputStream("out")
out.write(Marshal.dump(n))
out.close
val in = new FileInputStream("out")
val bytes = Stream.continually(in.read).takeWhile(-1 !=).map(_.toByte).toArray
val bar: List[Person] = Marshal.load[List[Person]](bytes)
println(s"bar:size=${bar.size}")
Ok(views.html.Application.olivier2(Nil))
}
}
and I have an error saying:
play.api.Application$$anon$1: Execution exception[[ClassNotFoundException: controllers.Application$$anonfun$index$1$Person$3]]
is there anyone having the answer?
edit: I thought the error could come from sbt, so I modified build.scala such as this:
import sbt._
import Keys._
import play.Project._
object ApplicationBuild extends Build {
val appName = "sms_play_2"
val appVersion = "1.0-SNAPSHOT"
val appDependencies = Seq(
// Add your project dependencies here,
jdbc,
anorm,
"com.typesafe.slick" % "slick_2.10" % "2.0.0",
"com.github.nscala-time" %% "nscala-time" % "0.6.0",
"org.xerial" % "sqlite-jdbc" % "3.7.2",
"org.quartz-scheduler" % "quartz" % "2.2.1",
"com.esotericsoftware.kryo" % "kryo" % "2.22",
"io.argonaut" %% "argonaut" % "6.0.2")
val mySettings = Seq(
(javaOptions in run) ++= Seq("-Dconfig.file=conf/dev.conf"))
val playCommonSettings = Seq(
Keys.fork := true)
val main = play.Project(appName, appVersion, appDependencies).settings(
Keys.fork in run := true,
resolvers += Resolver.sonatypeRepo("snapshots")).settings(mySettings: _*)
.settings(playCommonSettings: _*)
}
but without success, the error is still there (Class Person not found)
can you help me?

Scala Pickling has reasonable momentum and the approach has many advantages (lots of the heavy lifting is done at compile time). There is a plugable serialization mechanism and formats like json are supported.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Store Spark df to HBase - java

Related

How to efficiently WordCount for each file?

Flink does not checkpoint, and BucketingSink leaves files in pending state, when source is generated from Collection

Camel Processing JSON Messages from RabbitMQ

JSON4s can't find constructor w/spark

how to save a list of case classes in scala

Categories

Resources