elasticsearch java bulk batch size

elasticsearch java bulk batch size - java

I want to use the elasticsearch bulk api using java and wondering how I can set the batch size.
Currently I am using it as:
BulkRequestBuilder bulkRequest = getClient().prepareBulk();
while(hasMore) {
bulkRequest.add(getClient().prepareIndex(indexName, indexType, artist.getDocId()).setSource(json));
hasMore = checkHasMore();
}
BulkResponse bResp = bulkRequest.execute().actionGet();
//To check failures
log.info("Has failures? {}", bResp.hasFailures());
Any idea how I can set the bulk/batch size?

It mainly depends on the size of your documents, available resources on the client and the type of client (transport client or node client).
The node client is aware of the shards over the cluster and sends the documents directly to the nodes that hold the shards where they are supposed to be indexed. On the other hand the transport client is a normal client that sends its requests to a list of nodes in a round-robin fashion. The bulk request would be sent to one node then, which would become your gateway when indexing.
Since you're using the Java API, I would suggest you to have a look at the BulkProcessor, which makes it much easier and flexibile to index in bulk. You can either define a maximum number of actions, a maximum size and a maximum time interval since the last bulk execution. It's going to execute the bulk automatically for you when needed. You can also set a maximum number of concurrent bulk requests.
After you created the BulkProcessor like this:
BulkProcessor bulkProcessor = BulkProcessor.builder(client, new BulkProcessor.Listener() {
#Override
public void beforeBulk(long executionId, BulkRequest request) {
logger.info("Going to execute new bulk composed of {} actions", request.numberOfActions());
}
#Override
public void afterBulk(long executionId, BulkRequest request, BulkResponse response) {
logger.info("Executed bulk composed of {} actions", request.numberOfActions());
}
#Override
public void afterBulk(long executionId, BulkRequest request, Throwable failure) {
logger.warn("Error executing bulk", failure);
}
}).setBulkActions(bulkSize).setConcurrentRequests(maxConcurrentBulk).build();
You just have to add your requests to it:
bulkProcessor.add(indexRequest);
and close it at the end to flush any eventual requests that might have not been executed yet:
bulkProcessor.close();
To finally answer your question: the nice thing about the BulkProcessor is also that it has sensible defaults: 5 MB of size, 1000 actions, 1 concurrent request, no flush interval (which might be useful to set).

you need to count your bulk request builder when it hits your batch size limit then index them and flush older bulk builds .
here is example of code
Settings settings = ImmutableSettings.settingsBuilder()
.put("cluster.name", "MyClusterName").build();
TransportClient client = new TransportClient(settings);
String hostname = "myhost ip";
int port = 9300;
client.addTransportAddress(new InetSocketTransportAddress(hostname, port));
BulkRequestBuilder bulkBuilder = client.prepareBulk();
BufferedReader br = new BufferedReader(new InputStreamReader(new DataInputStream(new FileInputStream("my_file_path"))));
long bulkBuilderLength = 0;
String readLine = "";
String index = "my_index_name";
String type = "my_type_name";
String id = "";
while((readLine = br.readLine()) != null){
id = somefunction(readLine);
String json = new ObjectMapper().writeValueAsString(readLine);
bulkBuilder.add(client.prepareIndex(index, type, id).setSource(json));
bulkBuilderLength++;
if(bulkBuilderLength % 1000== 0){
logger.info("##### " + bulkBuilderLength + " data indexed.");
BulkResponse bulkRes = bulkBuilder.execute().actionGet();
if(bulkRes.hasFailures()){
logger.error("##### Bulk Request failure with error: " + bulkRes.buildFailureMessage());
}
bulkBuilder = client.prepareBulk();
}
}
br.close();
if(bulkBuilder.numberOfActions() > 0){
logger.info("##### " + bulkBuilderLength + " data indexed.");
BulkResponse bulkRes = bulkBuilder.execute().actionGet();
if(bulkRes.hasFailures()){
logger.error("##### Bulk Request failure with error: " + bulkRes.buildFailureMessage());
}
bulkBuilder = client.prepareBulk();
}
hope this helps you
thanks

Related

Grpc Server keeps processing the data even when client get disconnected

I have a server which streams the data for a given request below is the method which does that function
#Override
public void getChangeFeed(ChangeFeedRequest request, StreamObserver<ChangeFeedResponse> responseObserver) {
long queryDate = request.getFromDate();
long offset = request.getPageNo();
ChangeFeedResponse changeFeedResponse = processData(responseObserver, queryDate, offset);
while(true){
if(changeFeedResponse!=null && !changeFeedResponse.getFinalize()){
responseObserver.onNext(changeFeedResponse);
changeFeedResponse = processData(responseObserver, changeFeedResponse.getToDate(), changeFeedResponse.getPageNo());
}else{
break;
}
}
responseObserver.onNext(changeFeedResponse);
responseObserver.onCompleted();
}
When the client get disconnected the server still keeps on processing, this might be issue when multiple clients are fetching the data. Need to know how to tell server to stop processing

There's two fairly-equivalent ways. One is to use the Context, which is cancelled when the RPC is completed/cancelled:
while(!Context.current().isCancelled()){ // THIS LINE CHANGED
if(changeFeedResponse!=null && !changeFeedResponse.getFinalize()){
responseObserver.onNext(changeFeedResponse);
changeFeedResponse = processData(responseObserver, changeFeedResponse.getToDate(), changeFeedResponse.getPageNo());
}else{
break;
}
}
The other would be to use the ServerCallStreamObserver:
// THE NEXT TWO LINES CHANGED
ServerCallStreamObserver scso = (ServerCallStreamObserver) responseObserver;
while(!scso.isCancelled()){
if(changeFeedResponse!=null && !changeFeedResponse.getFinalize()){
responseObserver.onNext(changeFeedResponse);
changeFeedResponse = processData(responseObserver, changeFeedResponse.getToDate(), changeFeedResponse.getPageNo());
}else{
break;
}
}
Both approaches can also provide notification when a cancellation occurs, but polling is easiest in your case.

DropwizardMetricServices doesn't submit the gauge metric to JMX for second time (after removing the first time)

DropwizardMetricServices#submit() I'm using doesn't submit the gauge metric for second time.
i.e. My use-case is to remove the gauge metric from JMX after reading it. And my application can send the same metric (with different value).
For the first time the gauge metric is submitted successfully (then my application removes it once it reads the metric). But, the same metric is not submitted the second time.
So, I'm a bit confused what would be the reason for DropwizardMetricServices#submit() not to work for the second time?
Below is the code:
Submit metric:
private void submitNonSparseMetric(final String metricName, final long value) {
validateMetricName(metricName);
metricService.submit(metricName, value); // metricService is the DropwizardMetricServices
log(metricName, value);
LOGGER.debug("Submitted the metric {} to JMX", metricName);
}
Code that reads and removes the metric:
protected void collectMetrics() {
// Create the connection
Long currTime = System.currentTimeMillis()/1000; // Graphite needs
Socket connection = createConnection();
if (connection == null){
return;
}
// Get the output stream
DataOutputStream outputStream = getDataOutputStream(connection);
if (outputStream == null){
closeConnection();
return;
}
// Get metrics from JMX
Map<String, Gauge> g = metricRegistry.getGauges(); // metricRegistry is com.codahale.metrics.MetricRegistry
for(Entry<String, Gauge> e : g.entrySet()){
String key = e.getKey();
if(p2cMetric(key)){
String metricName = convertToMetricStandard(key);
String metricValue = String.valueOf(e.getValue().getValue());
String metricToSend = String.format("%s %s %s\n", metricName, metricValue, currTime);
try {
writeToStream(outputStream, metricToSend);
// Remove the metric from JMX after successfully sending metric to graphite
removeMetricFromJMX(key);
} catch (IOException e1) {
LOGGER.error("Unable to send metric to Graphite - {}", e1.getMessage());
}
}
}
closeOutputStream();
closeConnection();
}

I think I found the issue.
As per the DropwizardMetricServices doc - https://docs.spring.io/spring-boot/docs/current/api/org/springframework/boot/actuate/metrics/dropwizard/DropwizardMetricServices.html#submit-java.lang.String-double- ,
submit() method Set the specified gauge value.
So, I think it's recommended to use DropwizardMetricServices#submit() method to only set the values of any existing gauge metric in JMX and not for adding any new metric to JMX.
So, once I replaced DropwizardMetricServices#submit() with MetricRegistry#register() (com.codahale.metrics.MetricRegistry) method to submit all my metrics it worked as expected and my metrics are readded to JMX (once they were removed by my application).
But, I'm just wondering what makes DropwizardMetricServices#submit() to only add new metrics to JMX and not any metric that's already been removed (from JMX). Does DropwizardMetricServices cache (in memory) all the metrics submitted to JMX? that makes DropwizardMetricServices#submit() method not to resubmit the metric?

Vert.x performance drop when starting with -cluster option

I'm wondering if any one experienced the same problem.
We have a Vert.x application and in the end it's purpose is to insert 600 million rows into a Cassandra cluster. We are testing the speed of Vert.x in combination with Cassandra by doing tests in smaller amounts.
If we run the fat jar (build with Shade plugin) without the -cluster option, we are able to insert 10 million records in about a minute. When we add the -cluster option (eventually we will run the Vert.x application in cluster) it takes about 5 minutes for 10 million records to insert.
Does anyone know why?
We know that the Hazelcast config will create some overhead, but never thought it would be 5 times slower. This implies we will need 5 EC2 instances in cluster to get the same result when using 1 EC2 without the cluster option.
As mentioned, everything runs on EC2 instances:
2 Cassandra servers on t2.small
1 Vert.x server on t2.2xlarge

You are actually running into corner cases of the Vert.x Hazelcast Cluster manager.
First of all you are using a worker Verticle to send your messages (30000001). Under the hood Hazelcast is blocking and thus when you send a message from a worker the version 3.3.3 does not take that in account. Recently we added this fix https://github.com/vert-x3/issues/issues/75 (not present in 3.4.0.Beta1 but present in 3.4.0-SNAPSHOTS) that will improve this case.
Second when you send all your messages at the same time, it runs into another corner case that prevents the Hazelcast cluster manager to use a cache of the cluster topology. This topology cache is usually updated after the first message has been sent and sending all the messages in one shot prevents the usage of the ache (short explanation HazelcastAsyncMultiMap#getInProgressCount will be > 0 and prevents the cache to be used), hence paying the penalty of an expensive lookup (hence the cache).
If I use Bertjan's reproducer with 3.4.0-SNAPSHOT + Hazelcast and the following change: send message to destination, wait for reply. Upon reply send all messages then I get a lot of improvements.
Without clustering : 5852 ms
With clustering with HZ 3.3.3 :16745 ms
With clustering with HZ 3.4.0-SNAPSHOT + initial message : 8609 ms
I believe also you should not use a worker verticle to send that many messages and instead send them using an event loop verticle via batches. Perhaps you should explain your use case and we can think about the best way to solve it.

When you're you enable clustering (of any kind) to an application you are making your application more resilient to failures but you're also adding a performance penalty.
For example your current flow (without clustering) is something like:
client ->
vert.x app ->
in memory same process eventbus (negletible) ->
handler -> cassandra
<- vert.x app
<- client
Once you enable clustering:
client ->
vert.x app ->
serialize request ->
network request cluster member ->
deserialize request ->
handler -> cassandra
<- serialize response
<- network reply
<- deserialize response
<- vert.x app
<- client
As you can see there are many encode decode operations required plus several network calls and this all gets added to your total request time.
In order to achive best performance you need to take advantage of locality the closer you are of your data store usually the fastest.

Just to add the code of the project. I guess that would help.
Sender verticle:
public class ProviderVerticle extends AbstractVerticle {
#Override
public void start() throws Exception {
IntStream.range(1, 30000001).parallel().forEach(i -> {
vertx.eventBus().send("clustertest1", Json.encode(new TestCluster1(i, "abc", LocalDateTime.now())));
});
}
#Override
public void stop() throws Exception {
super.stop();
}
}
And the inserter verticle
public class ReceiverVerticle extends AbstractVerticle {
private int messagesReceived = 1;
private Session cassandraSession;
#Override
public void start() throws Exception {
PoolingOptions poolingOptions = new PoolingOptions()
.setCoreConnectionsPerHost(HostDistance.LOCAL, 2)
.setMaxConnectionsPerHost(HostDistance.LOCAL, 3)
.setCoreConnectionsPerHost(HostDistance.REMOTE, 1)
.setMaxConnectionsPerHost(HostDistance.REMOTE, 3)
.setMaxRequestsPerConnection(HostDistance.LOCAL, 20)
.setMaxQueueSize(32768)
.setMaxRequestsPerConnection(HostDistance.REMOTE, 20);
Cluster cluster = Cluster.builder()
.withPoolingOptions(poolingOptions)
.addContactPoints(ClusterSetup.SEEDS)
.build();
System.out.println("Connecting session");
cassandraSession = cluster.connect("kiespees");
System.out.println("Session connected:\n\tcluster [" + cassandraSession.getCluster().getClusterName() + "]");
System.out.println("Connected hosts: ");
cassandraSession.getState().getConnectedHosts().forEach(host -> System.out.println(host.getAddress()));
PreparedStatement prepared = cassandraSession.prepare(
"insert into clustertest1 (id, value, created) " +
"values (:id, :value, :created)");
PreparedStatement preparedTimer = cassandraSession.prepare(
"insert into timer (name, created_on, amount) " +
"values (:name, :createdOn, :amount)");
BoundStatement timerStart = preparedTimer.bind()
.setString("name", "clusterteststart")
.setInt("amount", 0)
.setTimestamp("createdOn", new Timestamp(new Date().getTime()));
cassandraSession.executeAsync(timerStart);
EventBus bus = vertx.eventBus();
System.out.println("Bus info: " + bus.toString());
MessageConsumer<String> cons = bus.consumer("clustertest1");
System.out.println("Consumer info: " + cons.address());
System.out.println("Waiting for messages");
cons.handler(message -> {
TestCluster1 tc = Json.decodeValue(message.body(), TestCluster1.class);
if (messagesReceived % 100000 == 0)
System.out.println("Message received: " + messagesReceived);
BoundStatement boundRecord = prepared.bind()
.setInt("id", tc.getId())
.setString("value", tc.getValue())
.setTimestamp("created", new Timestamp(new Date().getTime()));
cassandraSession.executeAsync(boundRecord);
if (messagesReceived % 100000 == 0) {
BoundStatement timerStop = preparedTimer.bind()
.setString("name", "clusterteststop")
.setInt("amount", messagesReceived)
.setTimestamp("createdOn", new Timestamp(new Date().getTime()));
cassandraSession.executeAsync(timerStop);
}
messagesReceived++;
//message.reply("OK");
});
}
#Override
public void stop() throws Exception {
super.stop();
cassandraSession.close();
}
}

Apache Curator - Zookeeper connection loss exception, possible memory leak

I have been working on a process that continuously monitors a distributed atomic long counter. It monitors it every minute using the following class ZkClient's method getCounter. In fact, I have multiple threads running each of which are monitoring a different counter (distributed atomic long) stored in the Zookeeper nodes. Each thread specifies the path of the counter via the parameters of the getCounter method.
public class TagserterZookeeperManager {
public enum ZkClient {
COUNTER("10.11.18.25:2181"); // Integration URL
private CuratorFramework client;
private ZkClient(String servers) {
Properties props = TagserterConfigs.ZOOKEEPER.getProperties();
String zkFromConfig = props.getProperty("servers", "");
if (zkFromConfig != null && !zkFromConfig.isEmpty()) {
servers = zkFromConfig.trim();
}
ExponentialBackoffRetry exponentialBackoffRetry = new ExponentialBackoffRetry(1000, 3);
client = CuratorFrameworkFactory.newClient(servers, exponentialBackoffRetry);
client.start();
}
public CuratorFramework getClient() {
return client;
}
}
public static String buildPath(String ... node) {
StringBuilder sb = new StringBuilder();
for (int i = 0; i < node.length; i++) {
if (node[i] != null && !node[i].isEmpty()) {
sb.append("/");
sb.append(node[i]);
}
}
return sb.toString();
}
public static DistributedAtomicLong getCounter(String taskType, int hid, String jobId, String countType) {
String path = buildPath(taskType, hid+"", jobId, countType);
Builder builder = PromotedToLock.builder().lockPath(path + "/lock").retryPolicy(new ExponentialBackoffRetry(10, 10));
DistributedAtomicLong count = new DistributedAtomicLong(ZkClient.COUNTER.getClient(), path, new RetryNTimes(5, 20), builder.build());
return count;
}
}
From within the threads, this is how I am calling this method:
DistributedAtomicLong counterTotal = TagserterZookeeperManager
.getCounter("testTopic", hid, jobId, "test");
Now it seems like after the threads have run for a few hours, at one stage I start getting the following org.apache.zookeeper.KeeperException$ConnectionLossException exception inside the getCounter method where it tries to read the count:
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /contentTaskProd
at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045)
at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1073)
at org.apache.curator.utils.ZKPaths.mkdirs(ZKPaths.java:215)
at org.apache.curator.utils.EnsurePath$InitialHelper$1.call(EnsurePath.java:148)
at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:107)
at org.apache.curator.utils.EnsurePath$InitialHelper.ensure(EnsurePath.java:141)
at org.apache.curator.utils.EnsurePath.ensure(EnsurePath.java:99)
at org.apache.curator.framework.recipes.atomic.DistributedAtomicValue.getCurrentValue(DistributedAtomicValue.java:254)
at org.apache.curator.framework.recipes.atomic.DistributedAtomicValue.get(DistributedAtomicValue.java:91)
at org.apache.curator.framework.recipes.atomic.DistributedAtomicLong.get(DistributedAtomicLong.java:72)
...
I keep getting this exception from thereon for a while and I get the feeling it is causing some internal memory leaks that eventually causes an OutOfMemory error and the whole process bails out. Does anybody have any idea what the reason for this could be? Why would Zookeeper suddenly start throwing the connection loss exception? After the process bails out, I can manually connect to Zookeeper through another small console program that I have written (also using curator) and all look good there.

In order to monitor a node in Zookeeper using curator you can use the NodeCache this won't solve your connection problems.... but instead of polling the node once a minute you can get a push event when it changes.
In my experience, the NodeCache handles quite well disconnection and resume of connections.

PUSH Notifications for >1000 devices through GCM Server(Java)

I have a GCM-backend Java server and I'm trying to send to all users a notification msg. Is my approach right? To just split them into 1000 each time before giving the send request? Or is there a better approach?
public void sendMessage(#Named("message") String message) throws IOException {
int count = ofy().load().type(RegistrationRecord.class).count();
if(count<=1000) {
List<RegistrationRecord> records = ofy().load().type(RegistrationRecord.class).limit(count).list();
sendMsg(records,message);
}else
{
int msgsDone=0;
List<RegistrationRecord> records = ofy().load().type(RegistrationRecord.class).list();
do {
List<RegistrationRecord> regIdsParts = regIdTrim(records, msgsDone);
msgsDone+=1000;
sendMsg(regIdsParts,message);
}while(msgsDone<count);
}
}
The regIdTrim method
private List<RegistrationRecord> regIdTrim(List<RegistrationRecord> wholeList, final int start) {
List<RegistrationRecord> parts = wholeList.subList(start,(start+1000)> wholeList.size()? wholeList.size() : start+1000);
return parts;
}
The sendMsg method
private void sendMsg(List<RegistrationRecord> records,#Named("message") String message) throws IOException {
if (message == null || message.trim().length() == 0) {
log.warning("Not sending message because it is empty");
return;
}
Sender sender = new Sender(API_KEY);
Message msg = new Message.Builder().addData("message", message).build();
// crop longer messages
if (message.length() > 1000) {
message = message.substring(0, 1000) + "[...]";
}
for (RegistrationRecord record : records) {
Result result = sender.send(msg, record.getRegId(), 5);
if (result.getMessageId() != null) {
log.info("Message sent to " + record.getRegId());
String canonicalRegId = result.getCanonicalRegistrationId();
if (canonicalRegId != null) {
// if the regId changed, we have to update the datastore
log.info("Registration Id changed for " + record.getRegId() + " updating to " + canonicalRegId);
record.setRegId(canonicalRegId);
ofy().save().entity(record).now();
}
} else {
String error = result.getErrorCodeName();
if (error.equals(Constants.ERROR_NOT_REGISTERED)) {
log.warning("Registration Id " + record.getRegId() + " no longer registered with GCM, removing from datastore");
// if the device is no longer registered with Gcm, remove it from the datastore
ofy().delete().entity(record).now();
} else {
log.warning("Error when sending message : " + error);
}
}
}
}

Quoting from Google Docs:
GCM is support for up to 1,000 recipients for a single message. This capability makes it much easier to send out important messages to your entire user base. For instance, let's say you had a message that needed to be sent to 1,000,000 of your users, and your server could handle sending out about 500 messages per second. If you send each message with only a single recipient, it would take 1,000,000/500 = 2,000 seconds, or around half an hour. However, attaching 1,000 recipients to each message, the total time required to send a message out to 1,000,000 recipients becomes (1,000,000/1,000) / 500 = 2 seconds. This is not only useful, but important for timely data, such as natural disaster alerts or sports scores, where a 30 minute interval might render the information useless.
Taking advantage of this functionality is easy. If you're using the GCM helper library for Java, simply provide a List collection of registration IDs to the send or sendNoRetry method, instead of a single registration ID.

We can not send more than 1000 push notification at time.I searched a lot but not result then i did this with same approach split whole list in sub lists of 1000 items and send push notification.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.