BigQuery - streaming via java is very slow

BigQuery - streaming via java is very slow - java

Im attempting to stream data from a kafka installation into BigQuery using Java based on Google samples. The data is JSON rows ~12K in length. I batching these into blocks of 500 (roughly 6Mb) and streaming them as:
InsertAllRequest.Builder builder = InsertAllRequest.newBuilder(tableId);
for (String record : bqStreamingPacket.getRecords()) {
Map<String, Object> mapObject = objectMapper.readValue(record.replaceAll("\\{,", "{"), new TypeReference<Map<String, Object>>() {});
// remove nulls
mapObject.values().removeIf(Objects::isNull);
// create an id for each row - use to retry / avoid duplication
builder.addRow(String.valueOf(System.nanoTime()), mapObject);
}
insertAllRequest = builder.build();
...
BigQueryOptions bigQueryOptions = BigQueryOptions.newBuilder().
setCredentials(Credentials.getAppCredentials()).build();
BigQuery bigQuery = bigQueryOptions.getService();
InsertAllResponse insertAllResponse = bigQuery.insertAll(insertAllRequest);
Im seeing insert times of 3-5 seconds for each call. Needless to say this makes BQ streaming less than useful. From their documents I was worried about hitting per-table insert quotas (Im streaming from Kafka at ~1M rows / min) but now Id be happy to deal with that problem.
All rows insert fine. No errors.
I must be doing something very wrong with this setup. Please advise.

We measure between 1200-2500 ms for each streaming request, and this was consistent over the last three years as you can see in the chart, we stream from Softlayer to Google.
Try to vary the numbers from hundreds to thousands row, or until you reach some streaming api limits and measure each call.
Based on this you can deduce more information such as bandwidth problem between you and BigQuery API, latency, SSL handshake, and eventually optimize it for your environment.
You can leave also your project id/table and maybe some Google engineer will check it.

Related

Spark Stream new Job after stream start

I have a situation where I am trying to stream using spark streaming from kafka. The stream is a direct stream. I am able to create a stream and then start streaming, also able to get any updates (if any) on kafka via the streaming.
The issue comes in when i have a new request to stream a new topic. Since SparkStreaming context can be only 1 per jvm, I cannot create a new stream for every new request.
The way I figured out is
Once a DStream is created and spark streaming is already in progress, just attach a new stream to it. This does not seem to work, the createDStream (for a new topic2) does not return a stream and further processing is stopped. The streaming keep on continuing on the first request (say topic1).
Second, I thought to stop the stream, create DStream and then start streaming again. I cannot use the same streaming context (it throws an excpection that jobs cannot be added after streaming has been stopped), and if I create a new stream for new topic (topic2), the old stream topic (topic1) is lost and it streams only the new one.
Here is the code, have a look
JavaStreamingContext javaStreamingContext;
if(null == javaStreamingContext) {
javaStreamingContext = JavaStreamingContext(sparkContext, Durations.seconds(duration));
} else {
StreamingContextState streamingContextState = javaStreamingContext.getState();
if(streamingContextState == StreamingContextState.STOPPED) {
javaStreamingContext = JavaStreamingContext(sparkContext, Durations.seconds(duration));
}
}
Collection<String> topics = Arrays.asList(getTopicName(schemaName));
SparkVoidFunctionImpl impl = new SparkVoidFunctionImpl(getSparkSession());
KafkaUtils.createDirectStream(javaStreamingContext,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String, String>Subscribe(topics, getKafkaParamMap()))
.map((stringStringConsumerRecord) -> stringStringConsumerRecord.value())
.foreachRDD(impl);
if (javaStreamingContext.getState() == StreamingContextState.ACTIVE) {
javaStreamingContext.start();
javaStreamingContext.awaitTermination();
}
Don't worry about SparkVoidFunctionImpl, this is a custom class with is the implementation of VoidFunction.
The above is approach 1, where i do not stop the existing streaming. When a new request comes into this method, it does not get a new streaming object, it tries to create a dstream. The issue is the DStream object is never returned.
KafkaUtils.createDirectStream(javaStreamingContext,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String, String>Subscribe(topics, getKafkaParamMap()))
This does not return a dstream, the control just terminates without an error.The steps further are not executed.
I have tried many things and read multiple article, but I belive this is a very common production level issue. Any streaming done is to be done on multiple different topics and each of them is handled differently.
Please help

The thing is spark master sends out code to workers and although the data is streaming, underlying code and variable values remain static unless job is restarted.
Few options I could think:
Spark Job server: Every time you want to subscribe/stream from a different topic instead of touching already running job, start a new job. From your API body you can supply the parameters or topic name. If you want to stop streaming from a specific topic, just stop respective job. It will give you a lot of flexibility and control on resources.
[Theoritical] Topic Filter: Subscribe all topics you think you will want, when records are pulled for a duration, filter out records based on a LIST of topics. Manipulate this list of topics through API to increase or decrease your scope of topics, it could be a broadcast variable as well. This is just an idea, I have not tried this option at all.
Another work around is to relay your Topic-2 data to Topic-1 using a microservice whenever you need it & stop if you don't want to.

Java REST service answer takes too much time

This is a problem i've been trying to deal with for almost a week without finding a real solution , here's the problem .
On my Angular client's side I have a button to generate a CSV file which works this way :
User clicks a button.
A POST request is sent to a REST JAX-RS webservice.
Webservice launches a database query and returns a JSON with all the lines needed to the client.
The AngularJS client receives a JSON processes it and generates the CSV.
All good here when there's a low volume of data to return , problems start when I have to return big amounts of data .Starting from 2000 lines I fell like the JBOSS server starts to struggle to send the data like i've reached a certain limit in data capacities (my eclipse where the server is running becomes very slow until the end of the data transmission )
The thing is that after testing i've found out it's not the Database query or the formating of the data that takes time but rather the sending of the data (3000 lines that are 2 MB in size take around 1 minute to reach the client) even though on my developper setup both the ANGULAR client And the JBOSS server are running on the same machine .
This is my Server side code :
#POST
#GZIP
#Path("/{id_user}/transactionsCsv")
#Produces(MediaType.APPLICATION_JSON)
#ApiOperation(value = "Transactions de l'utilisateur connecté sous forme CSV", response = TransactionDTO.class, responseContainer = "List")
#RolesAllowed(value = SecurityRoles.PORTAIL_ACTIVITE_RUBRIQUE)
public Response getOperationsCsv(#PathParam("id_user") long id_user,
#Context HttpServletRequest request,
#Context HttpServletResponse response,
final TransactionFiltreDTO filtre) throws IOException {
final UtilisateurSession utilisateur = (UtilisateurSession) request.getSession().getAttribute(UtilisateurSession.SESSION_CLE);
if (!utilisateur.getId().equals(id_user)) {
return genererReponse(new ResultDTO(Status.UNAUTHORIZED, null, null));
}
//database query
transactionDAO.getTransactionsDetailLimite(utilisateur.getId(), filtre);
//database query
List<Transaction> resultat = detailTransactionDAO.getTransactionsByUtilisateurId(utilisateur.getId(), filtre);
// To format the list to the export format
List<TransactionDTO> liste = Lists.transform(resultat, TransactionDTO.transactionToDTO);
return Response.ok(liste).build();
}
Do you guys have any idea about what is causing this problem or know another way to do things that might not cause this problem ? I would be grateful .
thank you :)
Here's the link for the JBOSS thread Dump :
http://freetexthost.com/y4kpwbdp1x

I've found in other contexts (using RMI) that the more local you are, the less worth it compression is. Your machine is probably losing most of its time on the processing work that compression and decompression require. The larger the amount of data, the greater the losses here.
Unless you really need to send this as one list, you might consider sending lists of entries. Requesting them page-wise to reduce the amount of data sent with one response. Even if you really need a single list on the client-side, you could assemble it after transport.

I'm convinced that the problem comes from the server trying to send big amount of data at once . Is there a way i can send the http answer in several small chunks instead of a single big one ?

To measure performance, we need to check the complete trace.
Many ways to do it, one of the way I find it easier.
Compress the output to ZIP, this reduces the data transfer over the network.
Index the column in Database, so that the query execution time decreases.
Check the processing time between several modules if any between different layers of code (REST -> Service -> DAO -> DB and vice versa)
If there wouldnt be much changes in the database, then you can introduce secondary caching mechanism and lower the cache eviction time or prefer the cache eviction policy as per your requirement.
To find the exact reason:
Collect the thread dump from a single run of the process.From that thread dump, we can check the exact time consumption of layers and pinpoint the problem.
Hope that helps !
[EDIT]
You should analyse the stack trace in dump and not the one added in the link.
If the larger portion of data is not able to process by the request,
Pagination, page size with number of pages might help(Only in case of non CSV file)
Limit, number of lines that can be processed.
Additional Query criteria like dates, users etc.
Sample REST URL :
http://localhost:8080/App/{id_user}/transactionCSV?limit=1000
http://localhost:8080/App/{id_user}/transactionCSV?fromDate=2011-08-01&toDate=2016-08-01
http://localhost:8080/App/{id_user}/transactionCSV?user=Admin

Java Elastic limit post execiton time

I do post like this:
Settings settings = Settings.settingsBuilder()
.put("cluster.name", "cluster-name")
.build();
client = TransportClient.builder()
.settings(settings)
.build();
client.addTransportAddress(new InetSocketTransportAddress(InetAddress.getByName("my.elastic.server"), 9300));
IndexResponse response = client
.prepareIndex("myindex", "info")
.setSource(data) //here data is stored in a Map
.get();
But data could be about 2Mb or more and I care about the speed it would be posted to elastic. What is the best way to limit that time? Such Elastic Java API feature or maybe run posting in a separate Thread or maybe something else? Thanks

You could utilize Spring Data Elasticsearch in Java and Spring Batch to create an index batch job. This way you can break the data up into smaller chunks, for more frequent but smaller writes to your index.
If your job is big enough(millions of records), you can utilize a multi-threaded batch job, and significantly reduce the time it takes to generate your index. This may be overkill for a smaller index though.

Is it possible to check what the provisioned throughput is for DynamoDB?

Is it possible for my (java) app to check DynamoDB's provisioned throughput for read/writes? For stability reasons it would be useful if I could get these numbers programmatically
I am aware that if I get a ProvisionedThroughputExceededException then I have exceeded my limit, but is there a way to find out what my read/write limits are before that happens?
I have also found some docs referring to describing limits but this doesn't seem to correspond to anything I can use in code
This is the first time I've used dynamodb so if this is fundamentally bad practice to do please say!
Cheers

The aws-java-sdk allows you to do that. Similar to http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/JavaDocumentAPIWorkingWithTables.html you can do either
AmazonDynamoDB dynamoClient = new AmazonDynamoDBClient();
DescribeTableResult result = dynamoClient.describeTable("MyTable");
Long readCapacityUnits = result.getTable()
.getProvisionedThroughput().getReadCapacityUnits();
or
AmazonDynamoDB dynamoClient = new AmazonDynamoDBClient();
DynamoDB dynamoDB = new DynamoDB(dynamoClient);
Table table = dynamoDB.getTable("MyTable");
Long readCapacityUnits = table.describe()
.getProvisionedThroughput().getReadCapacityUnits();
DynamoDb is a higher level wrapper, which sometimes has simpler APIs, AmazonDynamoDBClient is a rather straight implementation of the HTTP APIs.
For more on autoscaling dynamodb see
How to auto scale Amazon DynamoDB throughput?

You want to use the AWS SDK for Java, like this:
AmazonDynamoDBClient client = new AmazonDynamoDBClient();
client.describeTable("tableName").getTable().getProvisionedThroughput();

Here is the AWS CLI command to get the provisioned throughput for a table.
aws dynamodb describe-table --table-name <table name>

kinesis getting data from multiple shards

I am trying to build a simple application that reads data from AWS Kinesis. I have managed to read data using a single shard but I want to get data from 4 different shards.
Problem is, I have a while loop which iterates as long as the shard is active which prevents me from reading data from different shards. So far I couldn't find an alternative algorithm nor was able to implement a KCL-based solution.
Many thanks in advance
public static void DoSomething() {
AmazonKinesisClient client = new AmazonKinesisClient();
//noinspection deprecation
client.setEndpoint(endpoint, serviceName, regionId);
/** get shards from the stream using describe stream method*/
DescribeStreamRequest describeStreamRequest = new DescribeStreamRequest();
describeStreamRequest.setStreamName(streamName);
List<Shard> shards = new ArrayList<>();
String exclusiveStartShardId = null;
do {
describeStreamRequest.setExclusiveStartShardId(exclusiveStartShardId);
DescribeStreamResult describeStreamResult = client.describeStream(describeStreamRequest);
shards.addAll(describeStreamResult.getStreamDescription().getShards());
if (describeStreamResult.getStreamDescription().getHasMoreShards() && shards.size() > 0) {
exclusiveStartShardId = shards.get(shards.size() - 1).getShardId();
} else {
exclusiveStartShardId = null;
}
}while (exclusiveStartShardId != null);
/** shards obtained */
String shardIterator;
GetShardIteratorRequest getShardIteratorRequest = new GetShardIteratorRequest();
getShardIteratorRequest.setStreamName(streamName);
getShardIteratorRequest.setShardId(shards.get(0).getShardId());
getShardIteratorRequest.setShardIteratorType("LATEST");
GetShardIteratorResult getShardIteratorResult = client.getShardIterator(getShardIteratorRequest);
shardIterator = getShardIteratorResult.getShardIterator();
GetRecordsRequest getRecordsRequest = new GetRecordsRequest();
while (!shardIterator.equals(null)) {
getRecordsRequest.setShardIterator(shardIterator);
getRecordsRequest.setLimit(250);
GetRecordsResult getRecordsResult = client.getRecords(getRecordsRequest);
List<Record> records = getRecordsResult.getRecords();
shardIterator = getRecordsResult.getNextShardIterator();
if(records.size()!=0) {
for(Record r : records) {
System.out.println(r.getPartitionKey());
}
}
try {
Thread.sleep(1000);
} catch (InterruptedException e) {
}
}
}

It is recommended that you will not read from a single process/worker from multiple shards. First, as you can see it is adding to the complexity of your code, but more importantly, you will have problems scaling up.
The "secret" of scalability is to have small and independent workers or other such units. Such design you can see in Hadoop, DynamoDB or Kinesis in AWS. It allows you to build small systems (micro-services), that can easily scale up and down as needed. You can easily add more units of work/data as your service becomes more successful, or other fluctuations in its usage.
As you can see in these AWS services, you sometimes can get this scalability automatically such in DynamoDB, and sometimes you need add shards to your kinesis streams. But for your application you need to control somehow your scalability.
In the case of Kinesis, you can scale up and down using AWS Lambda or Kinesis Client Library (KCL). Both of them are listening to the status of your streams (number of shards and events) and using it to add or remove workers and deliver the events for them to process. In both of these solutions you should build a worker that is working against a single shard.
If you need to align events from multiple shards, you can do that using some state service such as Redis or DynamoDB.

For a simpler and neater solution where you only have to worry about providing your own message processing code, I would recommend using the KCL Library.
Quoting from the documentation
The KCL acts as an intermediary between your record processing logic
and Kinesis Data Streams. The KCL performs the following tasks:
Connects to the data stream
Enumerates the shards within the data stream
Uses leases to coordinates shard associations with its workers
Instantiates a record processor for every shard it manages
Pulls data records from the data stream
Pushes the records to the corresponding record processor
Checkpoints processed records
Balances shard-worker associations (leases) when the worker instance count changes or when the data stream is resharded (shards are split or merged)

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

BigQuery - streaming via java is very slow - java

Related

Spark Stream new Job after stream start

Java REST service answer takes too much time

Java Elastic limit post execiton time

Is it possible to check what the provisioned throughput is for DynamoDB?

kinesis getting data from multiple shards

Categories

Resources