I do post like this:
Settings settings = Settings.settingsBuilder()
.put("cluster.name", "cluster-name")
.build();
client = TransportClient.builder()
.settings(settings)
.build();
client.addTransportAddress(new InetSocketTransportAddress(InetAddress.getByName("my.elastic.server"), 9300));
IndexResponse response = client
.prepareIndex("myindex", "info")
.setSource(data) //here data is stored in a Map
.get();
But data could be about 2Mb or more and I care about the speed it would be posted to elastic. What is the best way to limit that time? Such Elastic Java API feature or maybe run posting in a separate Thread or maybe something else? Thanks
You could utilize Spring Data Elasticsearch in Java and Spring Batch to create an index batch job. This way you can break the data up into smaller chunks, for more frequent but smaller writes to your index.
If your job is big enough(millions of records), you can utilize a multi-threaded batch job, and significantly reduce the time it takes to generate your index. This may be overkill for a smaller index though.
Related
Is there a way to live stream data using spring-data-cassandra? Basically, I want to send data to client whenever there is a new addition to the database.
This is what I'm trying to do:-
#GetMapping(path = "mapping", produces = MediaType.TEXT_EVENT_STREAM_VALUE)
public Flux<Mapping> getMapping() {
Flux<Mapping> flux = reactiveMappingByExternalRepository.findAll();
Flux<Long> durationFlux = Flux.interval(Duration.ofSeconds(1));
return Flux.zip(flux, durationFlux).map(Tuple2::getT1);
}
But it doesn't return once the stream is complete.
The short answer is no, there's no live-streaming of real-time changes through the Cassandra driver. Although Cassandra has a CDC (Change Data Capture), it's quite low-level and you need to consume commit logs on the server. See Listen to a cassandra database with datastax for further details.
Im attempting to stream data from a kafka installation into BigQuery using Java based on Google samples. The data is JSON rows ~12K in length. I batching these into blocks of 500 (roughly 6Mb) and streaming them as:
InsertAllRequest.Builder builder = InsertAllRequest.newBuilder(tableId);
for (String record : bqStreamingPacket.getRecords()) {
Map<String, Object> mapObject = objectMapper.readValue(record.replaceAll("\\{,", "{"), new TypeReference<Map<String, Object>>() {});
// remove nulls
mapObject.values().removeIf(Objects::isNull);
// create an id for each row - use to retry / avoid duplication
builder.addRow(String.valueOf(System.nanoTime()), mapObject);
}
insertAllRequest = builder.build();
...
BigQueryOptions bigQueryOptions = BigQueryOptions.newBuilder().
setCredentials(Credentials.getAppCredentials()).build();
BigQuery bigQuery = bigQueryOptions.getService();
InsertAllResponse insertAllResponse = bigQuery.insertAll(insertAllRequest);
Im seeing insert times of 3-5 seconds for each call. Needless to say this makes BQ streaming less than useful. From their documents I was worried about hitting per-table insert quotas (Im streaming from Kafka at ~1M rows / min) but now Id be happy to deal with that problem.
All rows insert fine. No errors.
I must be doing something very wrong with this setup. Please advise.
We measure between 1200-2500 ms for each streaming request, and this was consistent over the last three years as you can see in the chart, we stream from Softlayer to Google.
Try to vary the numbers from hundreds to thousands row, or until you reach some streaming api limits and measure each call.
Based on this you can deduce more information such as bandwidth problem between you and BigQuery API, latency, SSL handshake, and eventually optimize it for your environment.
You can leave also your project id/table and maybe some Google engineer will check it.
I'm attampting to use a bulk HTTP api in Java on AWS ElasticSearch 2.3.
When I use a rest client for teh bulk load, I get the following error:
504 GATEWAY_TIMEOUT
When I run it as Lambda in Java, for HTTP Posts, I get:
{
"errorMessage": "2017-01-09T19:05:32.925Z 8e8164a7-d69e-11e6-8954-f3ac8e70b5be Task timed out after 15.00 seconds"
}
Through testing I noticed the bulk API doesn't work these with these settings:
"number_of_shards" : 5,
"number_of_replicas" : 5
When shards and replicas are set to 1, I can do a bulk load no problem.
I have tried using this setting to allow for my bulk load as well:
"refresh_interval" : -1
but so far it made no impact at all. In Java Lambda, I load my data as an InputStream from S3 location.
What are my options at this point for Java HTTP?
Is there anything else in index settings I could try?
Is there anything else in AWS access policy I could try?
Thank you for your time.
1Edit:
I also have tried these params: _bulk?action.write_consistency=one&refresh But makes no difference so far.
2Edit:
here is what made my bulk load work - set consistency param (I did NOT need to set refresh_interval):
URIBuilder uriBuilder = new URIBuilder(myuri);
uriBuilder = uriBuilder.addParameter("consistency", "one");
HttpPost post = new HttpPost(uriBuilder.build());
HttpEntity entity = new InputStreamEntity(myInputStream);
post.setEntity(entity);
From my experience, the issue can occur when your index replication settings can not be satisfied by your cluster. This happens either during a network partition, or if you simply set a replication requirement that can not be satisfied by your physical cluster.
In my case, this happens when I apply my production settings (number_of_replicas : 3) to my development cluster (which is single node cluster).
Your two solutions (setting the replica's to 1 Or setting your consistency to 1) resolve this issue because they allow Elastic to continue the bulk index w/o waiting for additional replica's to come online.
Elastic Search probably could have a more intuitive message on failure, maybe they do in Elastic 5.
Setting your cluster to a single
This is a problem i've been trying to deal with for almost a week without finding a real solution , here's the problem .
On my Angular client's side I have a button to generate a CSV file which works this way :
User clicks a button.
A POST request is sent to a REST JAX-RS webservice.
Webservice launches a database query and returns a JSON with all the lines needed to the client.
The AngularJS client receives a JSON processes it and generates the CSV.
All good here when there's a low volume of data to return , problems start when I have to return big amounts of data .Starting from 2000 lines I fell like the JBOSS server starts to struggle to send the data like i've reached a certain limit in data capacities (my eclipse where the server is running becomes very slow until the end of the data transmission )
The thing is that after testing i've found out it's not the Database query or the formating of the data that takes time but rather the sending of the data (3000 lines that are 2 MB in size take around 1 minute to reach the client) even though on my developper setup both the ANGULAR client And the JBOSS server are running on the same machine .
This is my Server side code :
#POST
#GZIP
#Path("/{id_user}/transactionsCsv")
#Produces(MediaType.APPLICATION_JSON)
#ApiOperation(value = "Transactions de l'utilisateur connecté sous forme CSV", response = TransactionDTO.class, responseContainer = "List")
#RolesAllowed(value = SecurityRoles.PORTAIL_ACTIVITE_RUBRIQUE)
public Response getOperationsCsv(#PathParam("id_user") long id_user,
#Context HttpServletRequest request,
#Context HttpServletResponse response,
final TransactionFiltreDTO filtre) throws IOException {
final UtilisateurSession utilisateur = (UtilisateurSession) request.getSession().getAttribute(UtilisateurSession.SESSION_CLE);
if (!utilisateur.getId().equals(id_user)) {
return genererReponse(new ResultDTO(Status.UNAUTHORIZED, null, null));
}
//database query
transactionDAO.getTransactionsDetailLimite(utilisateur.getId(), filtre);
//database query
List<Transaction> resultat = detailTransactionDAO.getTransactionsByUtilisateurId(utilisateur.getId(), filtre);
// To format the list to the export format
List<TransactionDTO> liste = Lists.transform(resultat, TransactionDTO.transactionToDTO);
return Response.ok(liste).build();
}
Do you guys have any idea about what is causing this problem or know another way to do things that might not cause this problem ? I would be grateful .
thank you :)
Here's the link for the JBOSS thread Dump :
http://freetexthost.com/y4kpwbdp1x
I've found in other contexts (using RMI) that the more local you are, the less worth it compression is. Your machine is probably losing most of its time on the processing work that compression and decompression require. The larger the amount of data, the greater the losses here.
Unless you really need to send this as one list, you might consider sending lists of entries. Requesting them page-wise to reduce the amount of data sent with one response. Even if you really need a single list on the client-side, you could assemble it after transport.
I'm convinced that the problem comes from the server trying to send big amount of data at once . Is there a way i can send the http answer in several small chunks instead of a single big one ?
To measure performance, we need to check the complete trace.
Many ways to do it, one of the way I find it easier.
Compress the output to ZIP, this reduces the data transfer over the network.
Index the column in Database, so that the query execution time decreases.
Check the processing time between several modules if any between different layers of code (REST -> Service -> DAO -> DB and vice versa)
If there wouldnt be much changes in the database, then you can introduce secondary caching mechanism and lower the cache eviction time or prefer the cache eviction policy as per your requirement.
To find the exact reason:
Collect the thread dump from a single run of the process.From that thread dump, we can check the exact time consumption of layers and pinpoint the problem.
Hope that helps !
[EDIT]
You should analyse the stack trace in dump and not the one added in the link.
If the larger portion of data is not able to process by the request,
Pagination, page size with number of pages might help(Only in case of non CSV file)
Limit, number of lines that can be processed.
Additional Query criteria like dates, users etc.
Sample REST URL :
http://localhost:8080/App/{id_user}/transactionCSV?limit=1000
http://localhost:8080/App/{id_user}/transactionCSV?fromDate=2011-08-01&toDate=2016-08-01
http://localhost:8080/App/{id_user}/transactionCSV?user=Admin
Is it possible for my (java) app to check DynamoDB's provisioned throughput for read/writes? For stability reasons it would be useful if I could get these numbers programmatically
I am aware that if I get a ProvisionedThroughputExceededException then I have exceeded my limit, but is there a way to find out what my read/write limits are before that happens?
I have also found some docs referring to describing limits but this doesn't seem to correspond to anything I can use in code
This is the first time I've used dynamodb so if this is fundamentally bad practice to do please say!
Cheers
The aws-java-sdk allows you to do that. Similar to http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/JavaDocumentAPIWorkingWithTables.html you can do either
AmazonDynamoDB dynamoClient = new AmazonDynamoDBClient();
DescribeTableResult result = dynamoClient.describeTable("MyTable");
Long readCapacityUnits = result.getTable()
.getProvisionedThroughput().getReadCapacityUnits();
or
AmazonDynamoDB dynamoClient = new AmazonDynamoDBClient();
DynamoDB dynamoDB = new DynamoDB(dynamoClient);
Table table = dynamoDB.getTable("MyTable");
Long readCapacityUnits = table.describe()
.getProvisionedThroughput().getReadCapacityUnits();
DynamoDb is a higher level wrapper, which sometimes has simpler APIs, AmazonDynamoDBClient is a rather straight implementation of the HTTP APIs.
For more on autoscaling dynamodb see
How to auto scale Amazon DynamoDB throughput?
You want to use the AWS SDK for Java, like this:
AmazonDynamoDBClient client = new AmazonDynamoDBClient();
client.describeTable("tableName").getTable().getProvisionedThroughput();
Here is the AWS CLI command to get the provisioned throughput for a table.
aws dynamodb describe-table --table-name <table name>