oracle to mongodb data migration using kafka - java

i am trying to migrate data from oracle to mongodb using kafka. I took a sample record set of 10 million with column length of 90 each row is of 5Kb
i am dividing the data into 10 threads but one of the thread is not running every time.... when i check the data i see 1 million records are missing in mongodb.
main class:
int totalRec = countNoOfRecordsToBeProcessed;
int minRownum =0;
int maxRownum =0;
int recInThread=totalRec/10;
System.out.println("oracle "+new Date());
for(int i=minRownum;i<=totalRec;i=i+recInThread+1){
KafkaThread kth = new KafkaThread(i, i+recInThread, conn);
Thread th = new Thread(kth);
th.start();
}
System.out.println("oracle done+ "+new Date());
kafka producer thread class:
JSONObject obj = new JSONObject();
while(rs.next()){
int total_rows = rs.getMetaData().getColumnCount();
for (int i = 0; i < total_rows; i++) {
obj.put(rs.getMetaData().getColumnLabel(i + 1)
.toLowerCase(), rs.getObject(i + 1));
}
//System.out.println("object->"+serializedObject);
producer.send(new ProducerRecord<String, String>("oracle_1",obj.toString()));
obj= new JSONObject();
//System.out.println(counter++);
}
consumer class:
KafkaConsumer consumer = new KafkaConsumer<>(props);
//subscribe to topic
consumer.subscribe(Arrays.asList(topicName));
MongoClientURI clientURI = new MongoClientURI(mongoURI);
MongoClient mongoClient = new MongoClient(clientURI);
MongoDatabase database = mongoClient.getDatabase(clientURI.getDatabase());
final MongoCollection<Document> collection = database.getCollection(clientURI.getCollection());
while (true) {
final ConsumerRecords<Long, String> consumerRecords =
consumer.poll(10000);
if (consumerRecords.count()!=0) {
List<InsertOneModel> list1 = new ArrayList<>();
consumerRecords.forEach(record -> {
// System.out.printf("Consumer Record:(%d, %s, %d, %d)\n",
// record.key(), record.value(),
// record.partition(), record.offset());'
String row =null;
row = record.value();
Document doc=Document.parse(row);
InsertOneModel t = new InsertOneModel<>(doc);
list1.add(t);
});
collection.bulkWrite((List<? extends WriteModel<? extends Document>>) (list1), new BulkWriteOptions().ordered(false));
consumer.commitAsync();
list1.clear();
}
}
}

My advice: use Kafka Connect JDBC connector to pull the data in, and a Kafka Connect MongoDB sink to push the data out. Otherwise you are just reinventing the wheel. Kafka Connect is part of Apache Kafka.
Getting started with Kafka Connect:
https://www.confluent.io/blog/simplest-useful-kafka-connect-data-pipeline-world-thereabouts-part-1/
https://www.confluent.io/blog/blogthe-simplest-useful-kafka-connect-data-pipeline-in-the-world-or-thereabouts-part-2/
https://www.confluent.io/blog/simplest-useful-kafka-connect-data-pipeline-world-thereabouts-part-3/

Related

Storm KafkaSpout does not get key (only value)

By using org.apache.kafka.clients.producer.* I try to send kafka-messages to a Storm Kafka Spout wit key:long value:String.
By checking the created record befor sending, key and value are set but at the receiving kafka spout, just the values are received. The value of the key is nothing or a tab.
Does anybody know such an issue?
My producer looks like:
Properties props = new Properties();
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, BOOTSTRAP_SERVERS);
props.put(ProducerConfig.CLIENT_ID_CONFIG, "KafkaDataProducer");
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, LongSerializer.class.getName());
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());
The record is created and sent by
final ProducerRecord<Long, String> record = new ProducerRecord<>(TOPIC, key, value);
RecordMetadata metadata = producer.send(record).get();
At the CLI I receive messages with kafka-console-consumer.sh --topic taxi --from-beginning --property print.key=true --property key.separator=" : " --bootstrap-server kafka1:9092
EDIT1:
The Storm Kafka Consumer looks like that
Properties props = new Properties();
props.put(ConsumerConfig.GROUP_ID_CONFIG, "1");
KafkaSpoutConfig spoutConfig = KafkaSpoutConfig.
builder("PLAINTEXT://kafka1:9092,PLAINTEXT://kafka2:9092,PLAINTEXT://kafka3:9092,", TOPIC)
.setProp(props)
.setFirstPollOffsetStrategy(FirstPollOffsetStrategy.EARLIEST)
.setProcessingGuarantee(KafkaSpoutConfig.ProcessingGuarantee.AT_MOST_ONCE)
.setOffsetCommitPeriodMs(100)
.build();
builder.setSpout("kafka_spout", new KafkaSpout(spoutConfig), 1);
EDIT2:
The procedure of the producer data is like:
Select from database and add to an ArrayList:
private static ArrayList<Pair<Long, String>> selectData(String start, String end) {
Statement statement = null;
ResultSet resultSet;
ArrayList<Pair<Long, String>> results = new ArrayList<>();
try {
if (conn != null) {
statement = conn.createStatement();
}
resultSet = statement.executeQuery("SELECT data1, data2, data3, data4 FROM " +
"tdrive " +
"where date_time between '" +
start +
"' and '" +
end +
"' order by date_time asc;");
while ( resultSet.next() ) {
int id = resultSet.getInt("data1");
String result = "";
result += id;
result += ";";
result += resultSet.getTimestamp("data2");
result += ";";
result += resultSet.getDouble("data3");
result += ";";
result += resultSet.getDouble("data4");
results.add(new Pair<>((long) id, result));
}
} catch (SQLException e) {
e.printStackTrace();
}
return results;
After the data were stored into the ArrayList all data are sent by Kafka:
for (Pair data : selectData(covertTime(selectStartTime), covertTime(selectEndTime))) {
String result = (String) data.getValue1();
produceMessage((Long) data.getValue0(), result);
}
producer.flush();
produceMessage is like:
private static void produceMessage(long key, String value) {
long time = System.currentTimeMillis();
try {
final ProducerRecord<Long, String> record = new ProducerRecord<>(TOPIC, key, value);
RecordMetadata metadata = producer.send(record).get();
long elapsedTime = System.currentTimeMillis() - time;
System.out.printf("sent record(key=%s value=%s) " +
"meta(partition=%d, offset=%d) time=%d\n", // key:id, value:"id;timestamp;long;lat"
record.key(), record.value(), metadata.partition(),
metadata.offset(), elapsedTime);
} catch (Exception e) {
System.err.println(e);
}
}
I hope after EDIT2, there is not too much code.
Thank you in advance

Retrieve last n messages of Kafka consumer from a particular topic

kafka version : 0.9.0.1
If n = 20,
I have to get last 20 messages of a topic.
I tried with
kafkaConsumer.seekToBeginning();
But it retrieves all the messages. I need to get only the last 20 messages.
This topic may have hundreds of thousands of records
public List<JSONObject> consumeMessages(String kafkaTopicName) {
KafkaConsumer<String, String> kafkaConsumer = null;
boolean flag = true;
List<JSONObject> messagesFromKafka = new ArrayList<>();
int recordCount = 0;
int i = 0;
int maxMessagesToReturn = 20;
Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");
props.put("group.id", "project.group.id");
props.put("max.partition.fetch.bytes", "1048576000");
props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
kafkaConsumer = new KafkaConsumer<>(props);
kafkaConsumer.subscribe(Arrays.asList(kafkaTopicName));
TopicPartition topicPartition = new TopicPartition(kafkaTopicName, 0);
LOGGER.info("Subscribed to topic " + kafkaConsumer.listTopics());
while (flag) {
// will consume all the messages and store in records
ConsumerRecords<String, String> records = kafkaConsumer.poll(1000);
kafkaConsumer.seekToBeginning(topicPartition);
// getting total records count
recordCount = records.count();
LOGGER.info("recordCount " + recordCount);
for (ConsumerRecord<String, String> record : records) {
if(record.value() != null) {
if (i >= recordCount - maxMessagesToReturn) {
// adding last 20 messages to messagesFromKafka
LOGGER.info("kafkaMessage "+record.value());
messagesFromKafka.add(new JSONObject(record.value()));
}
i++;
}
}
if (recordCount > 0) {
flag = false;
}
}
kafkaConsumer.close();
return messagesFromKafka;
}
You can use kafkaConsumer.seekToEnd(Collection<TopicPartition> partitions) to seek to the last offset of the given partition(s). As per the documentation:
"Seek to the last offset for each of the given partitions. This function evaluates lazily, seeking to the final offset in all partitions only when poll(Duration) or position(TopicPartition) are called. If no partitions are provided, seek to the final offset for all of the currently assigned partitions."
Then you can retrieve the position of a particular partition using position(TopicPartition partition).
Then you can reduce 20 from it, and use kafkaConsumer.seek(TopicPartition partition, long offset) to get to the most recent 20 messages.
Simply,
kafkaConsumer.seekToEnd(partitionList);
long endPosition = kafkaConsumer.position(topicPartiton);
long recentMessagesStartPosition = endPosition - maxMessagesToReturn;
kafkaConsumer.seek(topicPartition, recentMessagesStartPosition);
Now you can retrieve the most recent 20 messages using poll()
This is the simple logic, but if you have multiple partitions, you have to consider those cases as well. I did not try this, but hope you'll get the concept.

How to improve performance in Multi-thread way for using OrientDB?

I got something wrong in multi-thread way to using OrientDB.
They have total 20k records in the database and I want get the Top 200 records per thread.
If use one thread at a time I can get the result in 0,5 sec but when I use 10 threads at a time I will get all the result in 5 sec.
More threads will cost more times, 50 threads will cost 50 sec. That's too much time for API reply.
How can I improve the performance of using OrientDB?
I've already have read the document of the OrientDB about Performance-Tuning. I have tried update the parameters about Network Connection Pool but It was useless.
The version of OrientDB is 2.2.37 with single instance.
There have some code sample, just read the record.
public class Test3 {
public static void main(String[] args) {
try {
OServerAdmin serverAdmin = new OServerAdmin("remote:localhost").connect("root", "root");
if (!serverAdmin.existsDatabase("metadata", "plocal")) {
serverAdmin.createDatabase("metadata", "graph", "plocal");
}
serverAdmin.close();
} catch (IOException e1) {
// TODO Auto-generated catch block
e1.printStackTrace();
}
OrientGraphFactory factory= new OrientGraphFactory("remote:localhost/metadata", "root", "root");
factory.setAutoStartTx(false);
factory.setProperty("minPool", 5);
factory.setProperty("maxPool", 50);
factory.setupPool(5, 50);;
int threadCount = 5;
for (int i = 0; i < threadCount; i++) {
new Thread(() -> {
long start = System.currentTimeMillis();
OrientGraph orientGraph = factory.getTx();
String sql = "select * from mytable skip 0 limit 100";
Iterable<Vertex> vertices = orientGraph.command(new OCommandSQL(sql.toString())).execute();
System.out.println(Thread.currentThread().getName() + "===" + "execute sql cost:" + (System.currentTimeMillis() - start));
}).start();;
}
}
}

How to write Kafka Consumer Client in java to consume the messages from multiple brokers?

I was looking for java client (Kafka Consumer) to consume the messages from multiple brokers. please advice
Below is the code written to publish the messages to multiple brokers using simple partitioner.
Topic is created with replication factor "2" and partition "3".
public int partition(String topic, Object key, byte[] keyBytes, Object value, byte[] valueBytes, Cluster cluster)
{
List<PartitionInfo> partitions = cluster.partitionsForTopic(topic);
int numPartitions = partitions.size();
logger.info("Number of Partitions " + numPartitions);
if (keyBytes == null)
{
int nextValue = counter.getAndIncrement();
List<PartitionInfo> availablePartitions = cluster.availablePartitionsForTopic(topic);
if (availablePartitions.size() > 0)
{
int part = toPositive(nextValue) % availablePartitions.size();
int selectedPartition = availablePartitions.get(part).partition();
logger.info("Selected partition is " + selectedPartition);
return selectedPartition;
}
else
{
// no partitions are available, give a non-available partition
return toPositive(nextValue) % numPartitions;
}
}
else
{
// hash the keyBytes to choose a partition
return toPositive(Utils.murmur2(keyBytes)) % numPartitions;
}
}
public void publishMessage(String message , String topic)
{
Producer<String, String> producer = null;
try
{
producer = new KafkaProducer<>(producerConfigs());
logger.info("Topic to publish the message --" + this.topic);
for(int i =0 ; i < 10 ; i++)
{
producer.send(new ProducerRecord<String, String>(this.topic, message));
logger.info("Message Published Successfully");
}
}
catch(Exception e)
{
logger.error("Exception Occured " + e.getMessage()) ;
}
finally
{
producer.close();
}
}
public Map<String, Object> producerConfigs()
{
loadPropertyFile();
Map<String, Object> propsMap = new HashMap<>();
propsMap.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, brokerList);
propsMap.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class);
propsMap.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer.class);
propsMap.put(ProducerConfig.PARTITIONER_CLASS_CONFIG, SimplePartitioner.class);
propsMap.put(ProducerConfig.ACKS_CONFIG, "1");
return propsMap;
}
public Map<String, Object> consumerConfigs() {
Map<String, Object> propsMap = new HashMap<>();
System.out.println("properties.getBootstrap()" + properties.getBootstrap());
propsMap.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, properties.getBootstrap());
propsMap.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, false);
propsMap.put(ConsumerConfig.AUTO_COMMIT_INTERVAL_MS_CONFIG, properties.getAutocommit());
propsMap.put(ConsumerConfig.SESSION_TIMEOUT_MS_CONFIG, properties.getTimeout());
propsMap.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class);
propsMap.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class);
propsMap.put(ConsumerConfig.GROUP_ID_CONFIG, properties.getGroupid());
propsMap.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, properties.getAutooffset());
return propsMap;
}
#KafkaListener(id = "ID1", topics = "${config.topic}", group = "${config.groupid}")
public void listen(ConsumerRecord<?, ?> record)
{
logger.info("Message Consumed " + record);
logger.info("Partition From which Record is Received " + record.partition());
this.message = record.value().toString();
}
bootstrap.servers = [localhost:9092, localhost:9093, localhost:9094]
If you use a regular Java consumer, it will automatically read from multiple brokers. There is no special code you need to write. Just subscribe to the topic(s) you want to consumer and the consumer will connect to the corresponding brokers automatically. You only provide a "single entry point" broker -- the client figures out all other broker of the cluster automatically.
Number of Kafka broker nodes in cluster has nothing to do with consumer logic. Nodes in cluster only used for fault tolerance and bootstrap process. You placing messaging in different partitions of topic based on some custom logic it also not going to effect consumer logic. Even If you have single consumer than that consumer will consume messages from all partitions of Topic subscribed. I request you to check your code with Kafka cluster with single broker node...

DynamoDb pagination Query in java

I am new to dynamo db. I have to implement pagination. I have to show ten records in my html page. I am completely new to dynamo db. Can any one share any sample query for pagination in dynamo db. I have studied amazon dynamo db tutorial but i did not get any idea.
Can i implement pagination using highlevel and lowlevel api? can any one suggest where to start??
As yegor256 suggested, you could use query(QueryRequest) or scan(ScanRequest) with setExclusiveStartKey instead. Here's a code snippet of how to do it
HashMap<String, Condition> scanFilter = new HashMap<String, Condition>();
Condition condition = new Condition()
.withComparisonOperator(ComparisonOperator.LT.toString())
.withAttributeValueList(new AttributeValue().withN("100"));
scanFilter.put("column1", condition);
Boolean lastEval = true;
int count = 0;
ScanRequest scanRequest = new ScanRequest(tableName).withScanFilter(scanFilter);
while(lastEval) {
ScanResult scanResult = dynamoDB.scan(scanRequest);
count += scanResult.getCount();
System.out.println("Page Size: " + scanResult.getCount());
System.out.println("Total count = " + count);
if (scanResult.getLastEvaluatedKey() != null)
lastEval = scanResult.getLastEvaluatedKey().isEmpty() == false;
else
lastEval = false;
if (lastEval) {
scanRequest.setExclusiveStartKey(scanResult.getLastEvaluatedKey());
}
}
You should use query(QueryRequest) or scan(ScanRequest) with addExclusiveStartKeyEntry()
Also, check this library: jcabi-dynamo

Categories