i m working with kafka, and i made a producer like that:
synchronized (obj) {
while (true){
long start = Instant.now().toEpochMilli();
for (int i=0; i< NUM_MSG_SEC ; i++)
{
PriceStreamingData data = PriceStreamingData.newBuilder()
.setUser(getRequest().getUser())
.setSecurity(getRequest().getSecurity())
.setTimestamp(Instant.now().toEpochMilli())
.setPrice(new Random().nextDouble()*200)
.build();
record = new ProducerRecord<>(topic, keyBuilder.build(data),
data);
producer.send(record,new Callback(){
#Override
public void onCompletion(RecordMetadata arg0, Exception arg1) {
counter.incrementAndGet();
if(arg1 != null){
arg1.printStackTrace();
}
}
});
}
long diffCiclo = Instant.now().toEpochMilli() - start;
long diff = Instant.now().toEpochMilli() - startTime;
System.out.println("Number of sent: " + counter.get() +
" Millisecond:" + (diff) + " - NumberOfSent/Diff(K): " + counter.get()/diff );
try {
if(diffCiclo >= 1000){
System.out.println("over 1 second: " + diffCiclo);
}
else {
obj.wait( 1000 - diffCiclo );
}
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
as you can see it is extremely simple, it just make a new message and send it.
If i see the logs:
NumberOfSent/Diff(K)
in the first 10 seconds it perform very bad just
30k per second
after 60 seconds i have
180k per second
why ? and how can i already start the process going already to 180k ?
my kafka producer configuration is the Follwing
Async producer ( but also with sync producer the situation dose not change)
ACKS_CONFIG = 0
BATCH_SIZE_CONFIG = 20000
COMPRESSION_TYPE_CONFIG = none
LINGER_MS_CONFIG = 0
last detail:
NUM_MSG_SEC is set to 200000 or bigger number
I found the solution by myself and I hope this post can be useful for other people too.
The problem stand in the
ProducerConfig.BATCH_SIZE_CONFIG
and
ProducerConfig.LINGER_MS_CONFIG
My parameters were 20000 and 0, in order to fix the issue I did set them them to higher values 200000 and 1000. Finally I started the JVM with the parameters:
-XX:MinMetaspaceFreeRatio=100
-XX:MaxMetaspaceFreeRatio=100
because I saw it takes longer to set the metaspace to a decent value.
Now the producer start directly at 140k and in 1 second already is to 180k.
Related
I have a state which holds a data for 2 mins, sometime processElement still emits the record out even though there is a state present for that key.
#Override
public void processElement(EngagerEvents value, KeyedProcessFunction<String, EngagerEvents, String>.Context ctx, Collector<String> out) throws Exception {
if (anonymousIdHasBeenSeen.value() == null) {
System.out.println("time stamp emitting: " +jsonNode.get("server_timestamp"));
// key is not available in the state
anonymousIdHasBeenSeen.update(true);
System.out.println("TIMER START TIME: " +ctx.timestamp());
out.collect(value.getEventString());
ctx.timerService().registerProcessingTimeTimer(ctx.timestamp() + (stateTtl * 1000));
}
}
TIMER TRIGGER
-------------
public void onTimer(long timestamp, OnTimerContext ctx, Collector<String> out)
throws Exception {
// triggers after ttl has passed
System.out.println("Call back triggered : time : " +timestamp + " value : " +anonymousIdHasBeenSeen.value());
if (anonymousIdHasBeenSeen.value()) {
anonymousIdHasBeenSeen.clear();
}
}
My simulator code which produces data to kafka - im pumping anonymousId 111, 5 times
public static void main(String[] args) {
int n = 1; // Number of threads
for (int i = 0; i < n; i++) {
Thread thread = new Thread(new ExecutorThread());
thread.start();
}
for (int i = 0; i < 5; i++) {
ProducerRecord<String, String> record = new ProducerRecord<>("topic", key ,""{"anonymousId": "111", "device": "ios"}"");
try {
producer.send(record);
Thread.sleep(500);
} catch (SerializationException | InterruptedException e) {
// may need to do something with it
}
}
I'm keying by anonymousId. In my case there is only one anonymousId 111. call back trigger is 60 secs
DataStream<String> keyedStream = mappedEngagerEventsDataStream.keyBy(EngagerEvents::getAnonymousId)
.process(new KeyedProcessingWithCallBack(Long.parseLong(60))
.uid("engager-events-keyed-processing");
Am I doing something wrong here ? I tried debugging setting a break points, Even if the control doesnt go inside the If condition(sout insid If condition aren't printed) I see that particular event emitted out.
Am I doing something wrong here ? Why is the event emitted even though my out.collect inside If statement ? Can someone please point out what am I doing wrong here.
I am getting events from Kafka, enriching/filtering/transforming them on Spark and then storing them in ES. I am committing back the offsets to Kafka
I have two questions/problems:
(1) My current Spark job is VERY slow
I have 50 partitions for a topic and 20 executors. Each executor has 2 cores and 4g of memory each. My driver has 8g of memory. I am consuming 1000 events/partition/second and my batch interval is 10 seconds. This means, I am consuming 500000 events in 10 seconds
My ES cluster is as follows:
20 shards / index
3 master instances c5.xlarge.elasticsearch
12 instances m4.xlarge.elasticsearch
disk / node = 1024 GB so 12 TB in total
And I am getting huge scheduling and processing delays
(2) How can I commit offsets on executors?
Currently, I enrich/transform/filter my events on executors and then send everything to ES using BulkRequest. It's a synchronous process. If I get positive feedback, I send the offset list to driver. If not, I send back an empty list. On the driver, I commit offsets to Kafka. I believe, there should be a way, where I can commit offsets on executors but I don't know how to pass kafka Stream to executors:
((CanCommitOffsets) kafkaStream.inputDStream()).commitAsync(offsetRanges, this::onComplete);
This is the code for committing offsets to Kafka which requires Kafka Stream
Here is my overall code:
kafkaStream.foreachRDD( // kafka topic
rdd -> { // runs on driver
rdd.cache();
String batchIdentifier =
Long.toHexString(Double.doubleToLongBits(Math.random()));
LOGGER.info("## [" + batchIdentifier + "] Starting batch ...");
Instant batchStart = Instant.now();
List<OffsetRange> offsetsToCommit =
rdd.mapPartitionsWithIndex( // kafka partition
(index, eventsIterator) -> { // runs on worker
OffsetRange[] offsetRanges = ((HasOffsetRanges) rdd.rdd()).offsetRanges();
LOGGER.info(
"## Consuming " + offsetRanges[index].count() + " events" + " partition: " + index
);
if (!eventsIterator.hasNext()) {
return Collections.emptyIterator();
}
// get single ES documents
List<SingleEventBaseDocument> eventList = getSingleEventBaseDocuments(eventsIterator);
// build request wrappers
List<InsertRequestWrapper> requestWrapperList = getRequestsToInsert(eventList, offsetRanges[index]);
LOGGER.info(
"## Processed " + offsetRanges[index].count() + " events" + " partition: " + index + " list size: " + eventList.size()
);
BulkResponse bulkItemResponses = elasticSearchRepository.addElasticSearchDocumentsSync(requestWrapperList);
if (!bulkItemResponses.hasFailures()) {
return Arrays.asList(offsetRanges).iterator();
}
elasticSearchRepository.close();
return Collections.emptyIterator();
},
true
).collect();
LOGGER.info(
"## [" + batchIdentifier + "] Collected all offsets in " + (Instant.now().toEpochMilli() - batchStart.toEpochMilli()) + "ms"
);
OffsetRange[] offsets = new OffsetRange[offsetsToCommit.size()];
for (int i = 0; i < offsets.length ; i++) {
offsets[i] = offsetsToCommit.get(i);
}
try {
offsetManagementMapper.commit(offsets);
} catch (Exception e) {
// ignore
}
LOGGER.info(
"## [" + batchIdentifier + "] Finished batch of " + offsetsToCommit.size() + " messages " +
"in " + (Instant.now().toEpochMilli() - batchStart.toEpochMilli()) + "ms"
);
rdd.unpersist();
});
You can move the offset logic above the rdd loop ... I am using below template for better offset handling and performance
JavaInputDStream<ConsumerRecord<String, String>> kafkaStream = KafkaUtils.createDirectStream(jssc,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String, String>Subscribe(topics, kafkaParams));
kafkaStream.foreachRDD( kafkaStreamRDD -> {
//fetch kafka offsets for manually commiting it later
OffsetRange[] offsetRanges = ((HasOffsetRanges) kafkaStreamRDD.rdd()).offsetRanges();
//filter unwanted data
kafkaStreamRDD.filter(
new Function<ConsumerRecord<String, String>, Boolean>() {
#Override
public Boolean call(ConsumerRecord<String, String> kafkaRecord) throws Exception {
if(kafkaRecord!=null) {
if(!StringUtils.isAnyBlank(kafkaRecord.key() , kafkaRecord.value())) {
return Boolean.TRUE;
}
}
return Boolean.FALSE;
}
}).foreachPartition( kafkaRecords -> {
// init connections here
while(kafkaRecords.hasNext()) {
ConsumerRecord<String, String> kafkaConsumerRecord = kafkaRecords.next();
// work here
}
});
//commit offsets
((CanCommitOffsets) kafkaStream.inputDStream()).commitAsync(offsetRanges);
});
I have different sources of data from which I want to request in parallel (since each of this request is an http call and may be pretty time consuming). But I'm going to use only 1 response from these requests. So I kind of prioritize them. If the first response is invalid I'm going to check the second one. If it's also invalid I want to use the third, etc.
But I want to stop processing and return the result as soon as I receive the first correct response.
To simulate the problem I created the following code, where I'm trying to use java parallel streaming. But the problem is that I receive final results only after processing all requests.
public class ParallelExecution {
private static Supplier<Optional<Integer>> testMethod(String strInt) {
return () -> {
Optional<Integer> result = Optional.empty();
try {
result = Optional.of(Integer.valueOf(strInt));
System.out.printf("converted string %s to int %d\n",
strInt,
result.orElse(null));
} catch (NumberFormatException ex) {
System.out.printf("CANNOT CONVERT %s to int\n", strInt);
}
try {
int randomValue = result.orElse(10000);
TimeUnit.MILLISECONDS.sleep(randomValue);
System.out.printf("converted string %s to int %d in %d milliseconds\n",
strInt,
result.orElse(null), randomValue);
} catch (InterruptedException e) {
e.printStackTrace();
}
return result;
};
}
public static void main(String[] args) {
Instant start = Instant.now();
System.out.println("Starting program: " + start.toString());
List<Supplier<Optional<Integer>>> listOfFunctions = new ArrayList();
for (String arg: args) {
listOfFunctions.add(testMethod(arg));
}
Integer value = listOfFunctions.parallelStream()
.map(function -> function.get())
.filter(optValue -> optValue.isPresent()).map(val-> {
System.out.println("************** VAL: " + val);
return val;
}).findFirst().orElse(null).get();
Instant end = Instant.now();
Long diff = end.toEpochMilli() - start.toEpochMilli();
System.out.println("final value:" + value + ", worked during " + diff + "ms");
}
}
So when I execute the program using the following command:
$java ParallelExecution dfafj 34 1341 4656 dfad 245df 5767
I want to get the result "34" as soon as possible (around after 34 milliseconds) but in fact, I'm waiting for more than 10 seconds.
Could you help to find the most efficient solution for this problem?
ExecutorService#invokeAny looks like a good option.
List<Callable<Optional<Integer>>> tasks = listOfFunctions
.stream()
.<Callable<Optional<Integer>>>map(f -> f::get)
.collect(Collectors.toList());
ExecutorService service = Executors.newCachedThreadPool();
Optional<Integer> value = service.invokeAny(tasks);
service.shutdown();
I converted your List<Supplier<Optional<Integer>>> into a List<Callable<Optional<Integer>>> to be able to pass it in invokeAny. You may build Callables initially. Then, I created an ExecutorService and submitted the tasks.
The result of the first successfully executed task will be returned as soon as that result is returned from a task. Other tasks will end up interrupted.
You also may want to look into CompletionService.
List<Callable<Optional<Integer>>> tasks = Arrays
.stream(args)
.<Callable<Optional<Integer>>>map(arg -> () -> testMethod(arg).get())
.collect(Collectors.toList());
final ExecutorService underlyingService = Executors.newCachedThreadPool();
final ExecutorCompletionService<Optional<Integer>> service = new ExecutorCompletionService<>(underlyingService);
tasks.forEach(service::submit);
Optional<Integer> value = service.take().get();
underlyingService.shutdownNow();
You can use a queue to put your results in:
private static void testMethod(String strInt, BlockingQueue<Integer> queue) {
// your code, but instead of returning anything:
result.ifPresent(queue::add);
}
and then call it with
for (String s : args) {
CompletableFuture.runAsync(() -> testMethod(s, queue));
}
Integer result = queue.take();
Note that this will only handle the first result, as in your sample.
I have tried it using competableFutures and anyOf method. It will return when any one of the future is completed. Now, key to stop other tasks is to provide your own executor service to the completableFuture(s) and shutting it down when required.
public static void main(String[] args) {
Instant start = Instant.now();
System.out.println("Starting program: " + start.toString());
CompletableFuture<Optional<Integer>> completableFutures[] = new CompletableFuture[args.length];
ExecutorService es = Executors.newFixedThreadPool(args.length,r -> {
Thread t = new Thread(r);
t.setDaemon(false);
return t;
});
for (int i = 0;i < args.length; i++) {
completableFutures[i] = CompletableFuture.supplyAsync(testMethod(args[i]),es);
}
CompletableFuture.anyOf(completableFutures).
thenAccept(res-> {
System.out.println("Result - " + res + ", Time Taken : " + (Instant.now().toEpochMilli()-start.toEpochMilli()));
es.shutdownNow();
});
}
PS :It will throw interrupted exceptions that you can ignore in try catch block and not print the stack trace.Also, your thread pool size ideally should be same as length of args array.
I got something wrong in multi-thread way to using OrientDB.
They have total 20k records in the database and I want get the Top 200 records per thread.
If use one thread at a time I can get the result in 0,5 sec but when I use 10 threads at a time I will get all the result in 5 sec.
More threads will cost more times, 50 threads will cost 50 sec. That's too much time for API reply.
How can I improve the performance of using OrientDB?
I've already have read the document of the OrientDB about Performance-Tuning. I have tried update the parameters about Network Connection Pool but It was useless.
The version of OrientDB is 2.2.37 with single instance.
There have some code sample, just read the record.
public class Test3 {
public static void main(String[] args) {
try {
OServerAdmin serverAdmin = new OServerAdmin("remote:localhost").connect("root", "root");
if (!serverAdmin.existsDatabase("metadata", "plocal")) {
serverAdmin.createDatabase("metadata", "graph", "plocal");
}
serverAdmin.close();
} catch (IOException e1) {
// TODO Auto-generated catch block
e1.printStackTrace();
}
OrientGraphFactory factory= new OrientGraphFactory("remote:localhost/metadata", "root", "root");
factory.setAutoStartTx(false);
factory.setProperty("minPool", 5);
factory.setProperty("maxPool", 50);
factory.setupPool(5, 50);;
int threadCount = 5;
for (int i = 0; i < threadCount; i++) {
new Thread(() -> {
long start = System.currentTimeMillis();
OrientGraph orientGraph = factory.getTx();
String sql = "select * from mytable skip 0 limit 100";
Iterable<Vertex> vertices = orientGraph.command(new OCommandSQL(sql.toString())).execute();
System.out.println(Thread.currentThread().getName() + "===" + "execute sql cost:" + (System.currentTimeMillis() - start));
}).start();;
}
}
}
I am writing a UI that starts a SwingWorker to call some outside library functions, specifically from the neuroph library, to simulate neural networks. In the SwingWorker I either generate a population of Genomes or I run some population through a genetic algorithm to find the best Genomes.
The worker generates an initial population and returns fast enough that I can't tell if the calls to SwingWorker.process complete before the SwingWorker calls SwingWorker.done. Though running the population through the genetic algorithm causes the UI to freeze until it has completed (currently not allowing me to test any further). No .process messages are sent to the UI when the genetic algorithm logic is used, until it completes.
I also noticed that the library writes to the standard output for each LearningEvent generated by the instantiated neural network. So when the SwingWorker is processing the population of neural networks "tons" (3 lines per network learning and testing) of output is generated. Could this be causing the backup of .process calls back to the UI?
Is there a way to force a SwingWorker to wait until all of its .process messages have been sent and received by the UI?
Here is a code sample of the SwingWorker
public class MLPEnvironment extends SwingWorker<Boolean, String>
{
int gensRan = 0;
boolean usingGA;
DataSet envData;
MainView mainView;
LinkedList<Genome> population;
EnvironmentParameters envParms;
public MLPEnvironment(MainView inView, EnvironmentParameters inParms, LinkedList<Genome> inPop, DataSet inData)
{
envData = inData;
mainView = inView;
envParms = inParms;
population = inPop;
usingGA = envParms.evolveAtleastOneParameter();
}
// Main logic of worker
#Override
protected Boolean doInBackground() throws Exception
{
Boolean retVal = Boolean.TRUE;
// Generate a initial population if this flag is set
if(envParms.m_bGenerateInitPop)
{
newStatus("> Generating initial population...");
generateInitialPopulation();
}
// If we are not just generating a population, but running the GA
if(!envParms.m_bOnlyGenInitPop)
{
newStatus("> Running evolution on population...");
startBigBang();
newStatus("- Number of generations ran: " + gensRan);
}
// Otherwise just push the initial population to the UI for the user to see
else
{
newStatus("> Pushing population to UI...");
newStatus("ClearTable");
for(int i = 0; i < population.size(); i++)
{
Genome curGen = population.get(i);
String layerWidths = "";
for(int j = 0; j < curGen.getLayerWidths().size(); j++)
{
layerWidths += curGen.getLayerWidths().get(j).toString();
if(j != curGen.getLayerWidths().size()-1)
layerWidths += "-";
}
newStatus("NewRow" + GenomeFitnessResults.getResultsCSV(curGen) + curGen.getTFType() + "," + curGen.getLayerWidths().size() + "," + layerWidths + ",");
}
newStatus("- Done displaying initial population");
}
newStatus("Environment worker thread finished");
return retVal;
}
// Generate the initial population
private void generateInitialPopulation()
{
newStatus(" Initial population size: " + envParms.m_iInitPopSize);
newStatus(" DataInSize: " + envData.getInputSize() + " DataOutSize: " + envData.getOutputSize());
newStatus(" Trans: " + envParms.m_bEvolveTransferFunction + " Count: " + envParms.m_bEvolveHiddenLayerCount + " Widths: " + envParms.m_bEvolveHiddenLayerWidth);
for(int i = 0; i < envParms.m_iInitPopSize; i++)
{
population.add(Genome.getGenomeFromParms(envParms));
}
newStatus("- Finished generating initial population");
}
// The start of the GA, the beininng of the networks "universe"
private void startBigBang()
{
newStatus(" Using genetic algorithm: " + usingGA);
newStatus(" Evaluating initial population...");
population = Genome.evaluate(population, envData, envParms);
newStatus(" Done evaluating initial population");
if(usingGA)
{
newStatus(" > Starting genetic algorithm...");
for(int i = 0; i < envParms.m_iNumGenerations; i++)
{
gensRan++;
newStatus(" Generation: " + gensRan);
population = Genome.select(population, envParms);
population = Genome.crossOver(population, envParms);
population = Genome.mutate(population, envParms);
population = Genome.evaluate(population, envData, envParms);
}
newStatus(" - Genetic algorithm terminated");
}
newStatus("- Done running algorithm");
}
// Clean-up and closure after main process
#Override
protected void done()
{
try
{
final Boolean retVal = get();
mainView.environmentRunComplete(retVal, population);
}
catch (InterruptedException ex)
{
// Not sure who I can tell...
System.out.println("DC: InterruptedException");
mainView.environmentRunComplete(Boolean.FALSE, null);
}
catch (ExecutionException ex)
{
// Not sure who I can tell...
System.out.println("DC: ExecutionException");
mainView.environmentRunComplete(Boolean.FALSE, null);
}
}
// These are used to write updates to the main view
private void newStatus(String arg)
{
publish(arg);
}
#Override
protected void process(List<String> list)
{
list.stream().forEach((line) -> { mainView.newStatusLine(line); });
}
}
EDIT: So another way to put it.
I understand that
publish("a");
publish("b", "c");
publish("d", "e", "f");
Might actually result in
process("a", "b", "c", "d", "e", "f")
being called. Is there any defined interval when the process "batches" go to the UI? When I start the swing worker with a button click the UI becomes unresponsive, but the library prints system output lines, then once all the swingworker computation is done I then see all of the calls the newStatus in the UI.
So I know that the worker is doing some intense work, but why are all calls to newStatus over the few seconds it takes to do its work batched into a single publish after all work is complete? Shouldn't some publish calls get sent to the UI prior and before an intensive task is performed?
If anything, shouldn't the UI remain responsive because none of the messages are being shown as the swing worker is working?