FlinkCEP pattern detection doesn't happen in real time - java

I'm still new to Flink CEP library and yet I don't understand the pattern detection behavior.
Considering the example below, I have a Flink app that consumes data from a kafka topic, data is produced periodically, I want to use Flink CEP pattern to detect when a value is bigger than a given threshold.
The code is below:
public class CEPJob{
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "localhost:9092");
properties.setProperty("group.id", "test");
FlinkKafkaConsumer<String> consumer = new FlinkKafkaConsumer<String>("test", new SimpleStringSchema(),
properties);
consumer.assignTimestampsAndWatermarks(WatermarkStrategy.forMonotonousTimestamps());
DataStream<String> stream = env.addSource(consumer);
// Process incoming data.
DataStream<Stock> inputEventStream = stream.map(new MapFunction<String, Stock>() {
private static final long serialVersionUID = -491668877013085114L;
#Override
public Stock map(String value) {
String[] data = value.split(":");
System.out.println("Date: " + data[0] + ", Adj Close: " + data[1]);
Stock stock = new Stock(data[0], Double.parseDouble(data[1]));
return stock;
}
});
// Create the pattern
Pattern<Stock, ?> myPattern = Pattern.<Stock>begin("first").where(new SimpleCondition<Stock>() {
private static final long serialVersionUID = -6301755149429716724L;
#Override
public boolean filter(Stock value) throws Exception {
return (value.getAdj_Close() > 140.0);
}
});
// Create a pattern stream from our warning pattern
PatternStream<Stock> myPatternStream = CEP.pattern(inputEventStream, myPattern);
// Generate alert for each matched pattern
DataStream<Stock> warnings = myPatternStream .select((Map<String, List<Stock>> pattern) -> {
Stock first = pattern.get("first").get(0);
return first;
});
warnings.print();
env.execute("CEP job");
}
}
What happens when I run the job, pattern detection doesn't happen in real-time, it outputs the warning for the detected pattern of the current record only after a second record is produced, it looks like it's delayed to print to the log the warining, I really didn't understand how to make it outputs the warning the time it detect the pattern without waiting for next record and thank you :) .
Data coming from Kafka are in string format: "date:value", it produce data every 5 secs.
Java version: 1.8, Scala version: 2.11.12, Flink version: 1.12.2, Kafka version: 2.3.0

The solution I found that to send a fake record (a null object for example) in the Kafka topic every time I produce a value to the topic, and on the Flink side (in the pattern declaration) I test if the received record is fake or not.
It seems like FlinkCEP always waits for the upcoming event before it outputs the warning.

Related

How to drain the window after a Flink join using coGroup()?

I'd like to join data coming in from two Kafka topics ("left" and "right").
Matching records are to be joined using an ID, but if a "left" or a "right" record is missing, the other one should be passed downstream after a certain timeout. Therefore I have chosen to use the coGroup function.
This works, but there is one problem: If there is no message at all, there is always at least one record which stays in an internal buffer for good. It gets pushed out when new messages arrive. Otherwise it is stuck.
The expected behaviour is that all records should be pushed out after the configured idle timeout has been reached.
Some information which might be relevant
Flink 1.14.4
The Flink parallelism is set to 8, so is the number of partitions in both Kafka topics.
Flink checkpointing is enabled
Event-time processing is to be used
Lombok is used: So val is like final var
Some code snippets:
Relevant join settings
public static final int AUTO_WATERMARK_INTERVAL_MS = 500;
public static final Duration SOURCE_MAX_OUT_OF_ORDERNESS = Duration.ofMillis(4000);
public static final Duration SOURCE_IDLE_TIMEOUT = Duration.ofMillis(1000);
public static final Duration TRANSFORMATION_MAX_OUT_OF_ORDERNESS = Duration.ofMillis(5000);
public static final Duration TRANSFORMATION_IDLE_TIMEOUT = Duration.ofMillis(1000);
public static final Time JOIN_WINDOW_SIZE = Time.milliseconds(1500);
Create KafkaSource
private static KafkaSource<JoinRecord> createKafkaSource(Config config, String topic) {
val properties = KafkaConfigUtils.createConsumerConfig(config);
val deserializationSchema = new KafkaRecordDeserializationSchema<JoinRecord>() {
#Override
public void deserialize(ConsumerRecord<byte[], byte[]> record, Collector<JoinRecord> out) {
val m = JsonUtils.deserialize(record.value(), JoinRecord.class);
val copy = m.toBuilder()
.partition(record.partition())
.build();
out.collect(copy);
}
#Override
public TypeInformation<JoinRecord> getProducedType() {
return TypeInformation.of(JoinRecord.class);
}
};
return KafkaSource.<JoinRecord>builder()
.setProperties(properties)
.setBootstrapServers(config.kafkaBootstrapServers)
.setTopics(topic)
.setGroupId(config.kafkaInputGroupIdPrefix + "-" + String.join("_", topic))
.setDeserializer(deserializationSchema)
.setStartingOffsets(OffsetsInitializer.latest())
.build();
}
Create DataStreamSource
Then the DataStreamSource is built on top of the KafkaSource:
Configure "max out of orderness"
Configure "idleness"
Extract timestamp from record, to be used for event time processing
private static DataStreamSource<JoinRecord> createLeftSource(Config config,
StreamExecutionEnvironment env) {
val leftKafkaSource = createLeftKafkaSource(config);
val leftWms = WatermarkStrategy
.<JoinRecord>forBoundedOutOfOrderness(SOURCE_MAX_OUT_OF_ORDERNESS)
.withIdleness(SOURCE_IDLE_TIMEOUT)
.withTimestampAssigner((joinRecord, __) -> joinRecord.timestamp.toEpochSecond() * 1000L);
return env.fromSource(leftKafkaSource, leftWms, "left-kafka-source");
}
Use keyBy
The keyed sources are created on top of the DataSource instances like this:
Again configure "out of orderness" and "idleness"
Again extract timestamp
val leftWms = WatermarkStrategy
.<JoinRecord>forBoundedOutOfOrderness(TRANSFORMATION_MAX_OUT_OF_ORDERNESS)
.withIdleness(TRANSFORMATION_IDLE_TIMEOUT)
.withTimestampAssigner((joinRecord, __) -> {
if (VERBOSE_JOIN)
log.info("Left : " + joinRecord);
return joinRecord.timestamp.toEpochSecond() * 1000L;
});
val leftKeyedSource = leftSource
.keyBy(jr -> jr.id)
.assignTimestampsAndWatermarks(leftWms)
.name("left-keyed-source");
Join using coGroup
The join then combines the left and the right keyed sources
val joinedStream = leftKeyedSource
.coGroup(rightKeyedSource)
.where(left -> left.id)
.equalTo(right -> right.id)
.window(TumblingEventTimeWindows.of(JOIN_WINDOW_SIZE))
.apply(new CoGroupFunction<JoinRecord, JoinRecord, JoinRecord>() {
#Override
public void coGroup(Iterable<JoinRecord> leftRecords,
Iterable<JoinRecord> rightRecords,
Collector<JoinRecord> out) {
// Transform
val result = ...;
out.collect(result);
}
Write stream to console
The resulting joinedStream is written to the console:
val consoleSink = new PrintSinkFunction<JoinRecord>();
joinedStream.addSink(consoleSink);
How can I configure this join operation, so that all records are pushed downstream after the configured idle timeout?
If it can't be done this way: Is there another option?
This is the expected behavior. withIdleness doesn't try to handle the case where all streams are idle. It only helps in cases where there are still events flowing from at least one source partition/shard/split.
To get the behavior you desire (in the context of a continuous streaming job), you'll have to implement a custom watermark strategy that advances the watermark based on a processing time timer. Here's an implementation that uses the legacy watermark API.
On the other hand, if the job is complete and you just want to drain the final results before shutting it down, you can use the --drain option when you stop the job. Or if you use bounded sources this will happen automatically.

Flink Kafka - how to make App run in Parallel?

I am creating a app in Flink to
Read Messages from a topic
Do some simple process on it
Write Result to a different topic
My code does work, however it does not run in parallel
How do I do that?
It seems my code runs only on one thread/block?
On the Flink Web Dashboard:
App goes to running status
But, there is only one block shown in the overview subtasks
And Bytes Received / Sent, Records Received / Sent is always zero ( no Update )
Here is my code, please assist me in learning how to split my app to be able to run in parallel, and am I writing the app correctly?
public class SimpleApp {
public static void main(String[] args) throws Exception {
// create execution environment INPUT
StreamExecutionEnvironment env_in =
StreamExecutionEnvironment.getExecutionEnvironment();
// event time characteristic
env_in.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
// production Ready (Does NOT Work if greater than 1)
env_in.setParallelism(Integer.parseInt(args[0].toString()));
// configure kafka consumer
Properties properties = new Properties();
properties.setProperty("zookeeper.connect", "localhost:2181");
properties.setProperty("bootstrap.servers", "localhost:9092");
properties.setProperty("auto.offset.reset", "earliest");
// create a kafka consumer
final DataStream<String> consumer = env_in
.addSource(new FlinkKafkaConsumer09<>("test", new
SimpleStringSchema(), properties));
// filter data
SingleOutputStreamOperator<String> result = consumer.filter(new
FilterFunction<String>(){
#Override
public boolean filter(String s) throws Exception {
return s.substring(0, 2).contentEquals("PS");
}
});
// Process Data
// Transform String Records to JSON Objects
SingleOutputStreamOperator<JSONObject> data = result.map(new
MapFunction<String, JSONObject>()
{
#Override
public JSONObject map(String value) throws Exception
{
JSONObject jsnobj = new JSONObject();
if(value.substring(0, 2).contentEquals("PS"))
{
// 1. Raw Data
jsnobj.put("Raw_Data", value.substring(0, value.length()-6));
// 2. Comment
int first_index_comment = value.indexOf("$");
int last_index_comment = value.lastIndexOf("$") + 1;
// - set comment
String comment =
value.substring(first_index_comment, last_index_comment);
comment = comment.substring(0, comment.length()-6);
jsnobj.put("Comment", comment);
}
else {
jsnobj.put("INVALID", value);
}
return jsnobj;
}
});
// Write JSON to Kafka Topic
data.addSink(new FlinkKafkaProducer09<JSONObject>("localhost:9092",
"FilteredData",
new SimpleJsonSchema()));
env_in.execute();
}
}
My code does work, but it seems to run only on a single thread
( One block shown ) in web interface ( No passing of data, hence the bytes sent / received are not updated ).
How do I make it run in parallel ?
To run your job in parallel you can do 2 things:
Increase the parallelism of your job at the env level - i.e. do something like
StreamExecutionEnvironment env_in =
StreamExecutionEnvironment.getExecutionEnvironment().setParallelism(4);
But this would only increase parallelism at flink end after it reads the data, so if the source is producing data faster it might not be fully utilized.
To fully parallelize your job, setup multiple partitions for your kafka topic, ideally the amount of parallelism you would want with your flink job. So, you might want to do something like below when you are creating your kafka topic:
bin/kafka-topics.sh --create --zookeeper localhost:2181
--replication-factor 3 --partitions 4 --topic test

Null Pointer Exception when using KStream to change string to avro inKafka Streams

I'm stuck on this problem and I can't figure out what's going on. I am trying to use Kafka streams to write a log to a topic. On the other end, I have Kafka-connect entering the each entry into MySQL. So, basically what I need is a Kafka streams program which reads a topic as strings and parses it into Avro format and then enters it into a different topic.
Here's the code I wrote:
//Define schema
String userSchema = "{"
+ "\"type\":\"record\","
+ "\"name\":\"myrecord\","
+ "\"fields\":["
+ " { \"name\":\"ID\", \"type\":\"int\" },"
+ " { \"name\":\"COL_NAME_1\", \"type\":\"string\" },"
+ " { \"name\":\"COL_NAME_2\", \"type\":\"string\" }"
+ "]}";
String key = "key1";
Schema.Parser parser = new Schema.Parser();
Schema schema = parser.parse(userSchema);
//Settings
System.out.println("Kafka Streams Demonstration");
//Settings
Properties settings = new Properties();
// Set a few key parameters
settings.put(StreamsConfig.APPLICATION_ID_CONFIG, APP_ID);
// Kafka bootstrap server (broker to talk to); ubuntu is the host name for my VM running Kafka, port 9092 is where the (single) broker listens
settings.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
// Apache ZooKeeper instance keeping watch over the Kafka cluster; ubuntu is the host name for my VM running Kafka, port 2181 is where the ZooKeeper listens
settings.put(StreamsConfig.ZOOKEEPER_CONNECT_CONFIG, "localhost:2181");
// default serdes for serialzing and deserializing key and value from and to streams in case no specific Serde is specified
settings.put(StreamsConfig.KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
settings.put(StreamsConfig.VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
settings.put(StreamsConfig.STATE_DIR_CONFIG ,"/tmp");
// to work around exception Exception in thread "StreamThread-1" java.lang.IllegalArgumentException: Invalid timestamp -1
// at org.apache.kafka.clients.producer.ProducerRecord.<init>(ProducerRecord.java:60)
// see: https://groups.google.com/forum/#!topic/confluent-platform/5oT0GRztPBo
// Create an instance of StreamsConfig from the Properties instance
StreamsConfig config = new StreamsConfig(getProperties());
final Serde < String > stringSerde = Serdes.String();
final Serde < Long > longSerde = Serdes.Long();
final Serde<byte[]> byteArraySerde = Serdes.ByteArray();
// building Kafka Streams Model
KStreamBuilder kStreamBuilder = new KStreamBuilder();
// the source of the streaming analysis is the topic with country messages
KStream<byte[], String> instream =
kStreamBuilder.stream(byteArraySerde, stringSerde, "sqlin");
final KStream<byte[], GenericRecord> outstream = instream.mapValues(new ValueMapper<String, GenericRecord>() {
#Override
public GenericRecord apply(final String record) {
System.out.println(record);
GenericRecord avroRecord = new GenericData.Record(schema);
String[] array = record.split(" ", -1);
for (int i = 0; i < array.length; i = i + 1) {
if (i == 0)
avroRecord.put("ID", Integer.parseInt(array[0]));
if (i == 1)
avroRecord.put("COL_NAME_1", array[1]);
if (i == 2)
avroRecord.put("COL_NAME_2", array[2]);
}
System.out.println(avroRecord);
return avroRecord;
}
});
outstream.to("sqlout");
Here's the output after I get a Null Pointer Exception:
java -cp streams-examples-3.2.1-standalone.jar io.confluent.examples.streams.sql
Kafka Streams Demonstration
Start
Now started CountriesStreams Example
5 this is
{"ID": 5, "COL_NAME_1": "this", "COL_NAME_2": "is"}
Exception in thread "StreamThread-1" java.lang.NullPointerException
at org.apache.kafka.streams.processor.internals.SinkNode.process(SinkNode.java:81)
at org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:83)
at org.apache.kafka.streams.kstream.internals.KStreamMapValues$KStreamMapProcessor.process(KStreamMapValues.java:42)
at org.apache.kafka.streams.processor.internals.ProcessorNode$1.run(ProcessorNode.java:48)
at org.apache.kafka.streams.processor.internals.StreamsMetricsImpl.measureLatencyNs(StreamsMetricsImpl.java:188)
at org.apache.kafka.streams.processor.internals.ProcessorNode.process(ProcessorNode.java:134)
at org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:83)
at org.apache.kafka.streams.processor.internals.SourceNode.process(SourceNode.java:70)
at org.apache.kafka.streams.processor.internals.StreamTask.process(StreamTask.java:197)
at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:627)
at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:361)
The topic sqlin contains a few message which consists of a digit followed by two words. Note the two print lines: The function gets one message, and successfully parses it before catching a null pointer. The problem is I am new to Java, Kafka, and Avro so I'm not sure where I'm going wrong. Did I set up the Avro Schema right? Or am using kstream wrong? Any help here would be greatly appreciated.
I think the problem is the following line:
outstream.to("sqlout");
Your application is configured to, by default, use a String serde for record keys and record values:
settings.put(StreamsConfig.KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
settings.put(StreamsConfig.VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
Since outstream has type KStream<byte[], GenericRecord>, you must provide explicit serdes when calling to():
// sth like
outstream.to(Serdes.ByteArray(), yourGenericAvroSerde, "sqlout");
FYI: The next version of Confluent Platform (ETA: this month = June
2017) will ship with a ready-to-use generic + specific Avro
serde
that integrates with Confluent schema
registry. This
should make your life easier.
See my answer at https://stackoverflow.com/a/44433098/1743580 for further details.

Why does my Kafka Consumer consume messages quickly on first run, but slows down considerably in future runs?

I am a student researching and playing around with Kafka. After following the examples on the Apache documentation, I'm playing around with the examples portion in the trunk of their current Github repo.
As of right now, the example implements an 'older' version of their Consumer and does not employ the new KafkaConsumer. Following the documentation, I have written my own version of the KafkaConsumer thinking that it would be faster.
This is a vague question, but on runthrough I produce 5000 simple messages such as "Message_CurrentMessageNumber" to a topic "test" and then use my consumer to fetch these messages and print them to stdout. When I run the example code replacing the provided consumer with the newer KafkaConsumer (v 0.8.2 and up) it works pretty quickly and comparably to the example in its first runthrough, but slows down considerably anytime after that.
I notice that my Kafka Server outputs
Rebalancing group group1 generation 3 (kafka.coordinator.ConsumerCoordinator)
or similar messages often which leads me to believe that Kafka has to do some sort of load balancing that slows stuff down but I was wondering if anyone else had insight as to what I am doing wrong.
public class AlternateConsumer extends Thread {
private final KafkaConsumer<Integer, String> consumer;
private final String topic;
private final Boolean isAsync = false;
public AlternateConsumer(String topic) {
Properties properties = new Properties();
properties.put("bootstrap.servers", "localhost:9092");
properties.put("group.id", "newestGroup");
properties.put("partition.assignment.strategy", "roundrobin");
properties.put("enable.auto.commit", "true");
properties.put("auto.commit.interval.ms", "1000");
properties.put("session.timeout.ms", "30000");
properties.put("key.deserializer", "org.apache.kafka.common.serialization.IntegerDeserializer");
properties.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
consumer = new KafkaConsumer<Integer, String>(properties);
consumer.subscribe(topic);
this.topic = topic;
}
public void run() {
while (true) {
ConsumerRecords<Integer, String> records = consumer.poll(100);
for (ConsumerRecord<Integer, String> record : records) {
System.out.println("We received message: " + record.value() + " from topic: " + record.topic());
}
}
// ConsumerRecords<Integer, String> records = consumer.poll(0);
// for (ConsumerRecord<Integer, String> record : records) {
// System.out.println("We received message: " + record.value() + " from topic: " + record.topic());
// }
// consumer.close();
}
}
To start:
package kafka.examples;
public class KafkaConsumerProducerDemo implements KafkaProperties
{
public static void main(String[] args) {
final boolean isAsync = args.length > 0 ? !args[0].trim().toLowerCase().equals("sync") : true;
Producer producerThread = new Producer("test", isAsync);
producerThread.start();
AlternateConsumer consumerThread = new AlternateConsumer("test");
consumerThread.start();
}
}
The producer is the default producer located here: https://github.com/apache/kafka/blob/trunk/examples/src/main/java/kafka/examples/Producer.java
This should not be the case. If the setup is similar between your two consumers you should expect better result with new consumer unless there is issue in the client/consumer implementation, which seems to be the case here.
Can you share your benchmark results and the frequency of reported rebalancing and/or any pattern (i.e. sluggish once at startup, after fixed message consumption, after the queue is drained, etc) you are observing. Also if you can share some details about your consumer implementation.

What is the fastest way to bulk load data into HBase programmatically?

I have a Plain text file with possibly millions of lines which needs custom parsing and I want to load it into an HBase table as fast as possible (using Hadoop or HBase Java client).
My current solution is based on a MapReduce job without the Reduce part. I use FileInputFormat to read the text file so that each line is passed to the map method of my Mapper class. At this point the line is parsed to form a Put object which is written to the context. Then, TableOutputFormat takes the Put object and inserts it to table.
This solution yields an average insertion rate of 1,000 rows per second, which is less than what I expected. My HBase setup is in pseudo distributed mode on a single server.
One interesting thing is that during insertion of 1,000,000 rows, 25 Mappers (tasks) are spawned but they run serially (one after another); is this normal?
Here is the code for my current solution:
public static class CustomMap extends Mapper<LongWritable, Text, ImmutableBytesWritable, Put> {
protected void map(LongWritable key, Text value, Context context) throws IOException {
Map<String, String> parsedLine = parseLine(value.toString());
Put row = new Put(Bytes.toBytes(parsedLine.get(keys[1])));
for (String currentKey : parsedLine.keySet()) {
row.add(Bytes.toBytes(currentKey),Bytes.toBytes(currentKey),Bytes.toBytes(parsedLine.get(currentKey)));
}
try {
context.write(new ImmutableBytesWritable(Bytes.toBytes(parsedLine.get(keys[1]))), row);
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
public int run(String[] args) throws Exception {
if (args.length != 2) {
return -1;
}
conf.set("hbase.mapred.outputtable", args[1]);
// I got these conf parameters from a presentation about Bulk Load
conf.set("hbase.hstore.blockingStoreFiles", "25");
conf.set("hbase.hregion.memstore.block.multiplier", "8");
conf.set("hbase.regionserver.handler.count", "30");
conf.set("hbase.regions.percheckin", "30");
conf.set("hbase.regionserver.globalMemcache.upperLimit", "0.3");
conf.set("hbase.regionserver.globalMemcache.lowerLimit", "0.15");
Job job = new Job(conf);
job.setJarByClass(BulkLoadMapReduce.class);
job.setJobName(NAME);
TextInputFormat.setInputPaths(job, new Path(args[0]));
job.setInputFormatClass(TextInputFormat.class);
job.setMapperClass(CustomMap.class);
job.setOutputKeyClass(ImmutableBytesWritable.class);
job.setOutputValueClass(Put.class);
job.setNumReduceTasks(0);
job.setOutputFormatClass(TableOutputFormat.class);
job.waitForCompletion(true);
return 0;
}
public static void main(String[] args) throws Exception {
Long startTime = Calendar.getInstance().getTimeInMillis();
System.out.println("Start time : " + startTime);
int errCode = ToolRunner.run(HBaseConfiguration.create(), new BulkLoadMapReduce(), args);
Long endTime = Calendar.getInstance().getTimeInMillis();
System.out.println("End time : " + endTime);
System.out.println("Duration milliseconds: " + (endTime-startTime));
System.exit(errCode);
}
I've gone through a process that is probably very similar to yours of attempting to find an efficient way to load data from an MR into HBase. What I found to work is using HFileOutputFormat as the OutputFormatClass of the MR.
Below is the basis of my code that I have to generate the job and the Mapper map function which writes out the data. This was fast. We don't use it anymore, so I don't have numbers on hand, but it was around 2.5 million records in under a minute.
Here is the (stripped down) function I wrote to generate the job for my MapReduce process to put data into HBase
private Job createCubeJob(...) {
//Build and Configure Job
Job job = new Job(conf);
job.setJobName(jobName);
job.setMapOutputKeyClass(ImmutableBytesWritable.class);
job.setMapOutputValueClass(Put.class);
job.setMapperClass(HiveToHBaseMapper.class);//Custom Mapper
job.setJarByClass(CubeBuilderDriver.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(HFileOutputFormat.class);
TextInputFormat.setInputPaths(job, hiveOutputDir);
HFileOutputFormat.setOutputPath(job, cubeOutputPath);
Configuration hConf = HBaseConfiguration.create(conf);
hConf.set("hbase.zookeeper.quorum", hbaseZookeeperQuorum);
hConf.set("hbase.zookeeper.property.clientPort", hbaseZookeeperClientPort);
HTable hTable = new HTable(hConf, tableName);
HFileOutputFormat.configureIncrementalLoad(job, hTable);
return job;
}
This is my map function from the HiveToHBaseMapper class (slightly edited ).
public void map(WritableComparable key, Writable val, Context context)
throws IOException, InterruptedException {
try{
Configuration config = context.getConfiguration();
String[] strs = val.toString().split(Constants.HIVE_RECORD_COLUMN_SEPARATOR);
String family = config.get(Constants.CUBEBUILDER_CONFIGURATION_FAMILY);
String column = strs[COLUMN_INDEX];
String Value = strs[VALUE_INDEX];
String sKey = generateKey(strs, config);
byte[] bKey = Bytes.toBytes(sKey);
Put put = new Put(bKey);
put.add(Bytes.toBytes(family), Bytes.toBytes(column), (value <= 0)
? Bytes.toBytes(Double.MIN_VALUE)
: Bytes.toBytes(value));
ImmutableBytesWritable ibKey = new ImmutableBytesWritable(bKey);
context.write(ibKey, put);
context.getCounter(CubeBuilderContextCounters.CompletedMapExecutions).increment(1);
}
catch(Exception e){
context.getCounter(CubeBuilderContextCounters.FailedMapExecutions).increment(1);
}
}
I pretty sure this isn't going to be a Copy&Paste solution for you. Obviously the data I was working with here didn't need any custom processing (that was done in a MR job before this one). The main thing I want to provide out of this is the HFileOutputFormat. The rest is just an example of how I used it. :)
I hope it gets you onto a solid path to a good solution. :
One interesting thing is that during insertion of 1,000,000 rows, 25 Mappers (tasks) are spawned but they run serially (one after another); is this normal?
mapreduce.tasktracker.map.tasks.maximum parameter which is defaulted to 2 determines the maximum number of tasks that can run in parallel on a node. Unless changed, you should see 2 map tasks running simultaneously on each node.

Categories