Amazon Kinesis getRecord() mismatch results - java

I just started using Kinesis whose API is available here
I've used this to push 100 records to kinesis
for (int j = 0; j < 100; j++) {
PutRecordRequest putRecordRequest = new PutRecordRequest();
putRecordRequest.setStreamName(myStreamName);
putRecordRequest.setData(ByteBuffer.wrap(data.getBytes()));
putRecordRequest.setPartitionKey(String.format("partitionKey-%d", j));
PutRecordResult putRecordResult = kinesisClient.putRecord(putRecordRequest);
System.out.println("Successfully putrecord, partition key : " + putRecordRequest.getPartitionKey()
+ ", ShardID : " + putRecordResult.getShardId() + ", Sequence No : "+ putRecordResult.getSequenceNumber());
}
Now I want to get the number of records which got pushed. For that I am using this :
Iterator<Shard> shardIterator = getTotalShardsIterator();//Implemented and giving perfectly all the shards.....
Now using the above iterator I am getting the count as:
.....
while (shardIterator.hasNext()) {
Shard shard = shardIterator.next();
String shardId = shard.getShardId();
int datacount = getDataCount(shardId, myStreamName);
totalStreamDataCount+= datacount;
System.out.println("Data Count for Shard " + shardId + " is : " + datacount);
}
.....
Here is my function getDataCount(shardId, myStreamName)
public static int getDataCount(String shardId, String streamName) {
int dataCount = 0;
String shardIterator;
GetShardIteratorRequest getShardIteratorRequest = new GetShardIteratorRequest();
getShardIteratorRequest.setStreamName(streamName);
getShardIteratorRequest.setShardId(shardId);
getShardIteratorRequest.setShardIteratorType(ShardIteratorType.TRIM_HORIZON);
GetShardIteratorResult getShardIteratorResult = kinesisClient.getShardIterator(getShardIteratorRequest);
shardIterator = getShardIteratorResult.getShardIterator();
GetRecordsRequest getRecordsRequest = new GetRecordsRequest();
getRecordsRequest.setShardIterator(shardIterator);
getRecordsRequest.setLimit(1000);
GetRecordsResult getRecordsResult = kinesisClient.getRecords(getRecordsRequest);
List<Record> records = getRecordsResult.getRecords();
if(!records.isEmpty() && records.size() > 0){
dataCount = records.size();
Iterator<Record> iterator = records.iterator();
while(iterator.hasNext()) {
Record record = iterator.next();
byte[] bytes = record.getData().array();
String recordData = new String(bytes);
System.out.println("Shard Id. :"+shardId+"Seq. No. is : "+" Record data :"+recordData);
}
}
return dataCount;
}
But this code is giving mismatch results every time I run this, like sometimes it shows 81 some time 91
Please shed some light on this..:)

Records pushed to kinesis has limit (data/sec) so records may get failed if it exceeds this (Speed depends on number of shards used ).
Now you can get the failed records count with kinesis API and Push those again to kinesis.
List<CompletionStage<PutRecordsResponse>> putRecordResponseCompletionStage = <PutRecordRequest call with Stream Name , Partition key and Data>;
AtomicInteger failedCount = new AtomicInteger();
AtomicInteger recordsCount = new AtomicInteger();
int loopCount = 0;
for (CompletionStage<PutRecordsResponse> stage : putRecordResponseCompletionStage) {
loopCount++;
try {
PutRecordsResponse response = stage.toCompletableFuture().get();
failedCount.addAndGet(response.failedRecordCount());
recordsCount.addAndGet(response.records().size());
} catch (InterruptedException e) {
e.printStackTrace();
} catch (ExecutionException e) {
e.printStackTrace();
}
}
Check example [here](https://docs.aws.amazon.com/streams/latest/dev/developing-producers-with-sdk.html)

Related

Subtracting values of two maps whenever there is a key match?

I'll explain the logic: I am reading a XML file which contain many request and responses in soap format then I'm storing the request and response in two Hash map. In first Hash map I'm storing transaction Id(unique) as key and values as request time,til-name. In second hash map I'm storing transaction Id(unique) as key and values as response time. In both hash map the keys are same but values are different, by using for loop iterating two loops and I need to get the time difference between response time and request time
eg:request time:2020-01-30T11:07:08.351Z and response time:2020-01-30T11:07:10.152Z
public class MapTimeDiff {
public static void main(String[] args) throws ParseException {
File file =new File("C:\\Users\\gsanaulla\\Documents\\My Received Files\\ecarewsframework.xml");
Scanner in = null;
String tilname = null;
String transactionId = null;
String requesttime = null;
String responsetime = null;
Date dateOne = null;
Date dateTwo = null;
double timeDiff;
DateFormat df = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSS");
Map<String,ArrayList<String>> request=new HashMap<String,ArrayList<String>>();
ArrayList<String> req=new ArrayList<String>();
Map<String,ArrayList<String>> response=new HashMap<String,ArrayList<String>>();
ArrayList<String> res=new ArrayList<String>();
try {
in = new Scanner(file);
while(in.hasNext())
{
String line=in.nextLine();
if(line.contains("</S:Envelope>")) {
System.out.println(line);
tilname=line.split("StartRecord><")[1].split("><")[0].split(":")[1];
System.out.println("tilname :: "+tilname);
transactionId = line.split("transactionId>")[1].split("<")[0];
System.out.println("transactio id :: "+transactionId);
requesttime=line.split("sourceTimestamp>")[1].split("<")[0];
System.out.println("request time is :: "+requesttime);
dateOne = df.parse(requesttime);
}
req.add(tilname);
req.add(dateOne.toString());
System.out.println("req is==== " +req);
request.put(transactionId,req);
System.out.println("request is==== " +request.get(transactionId));
if(line.contains("</SOAP-ENV:Envelope>")) {
//System.out.println(line);
if(line.contains("transactionId"))
{
responsetime=line.split("sourceTimestamp>")[1].split("<")[0];
transactionId = line.split("transactionId>")[1].split("<")[0];
System.out.println("responsetime :: "+responsetime);
System.out.println("transaction id "+transactionId);
dateTwo = df.parse(responsetime);
}
res.add(dateTwo.toString());
System.out.println("res is===== "+res);
response.put(transactionId,res);
System.out.println("response is===== "+response.get(transactionId));
for (Entry<String, ArrayList<String>> entry : request.entrySet()) {
for (Entry<String, ArrayList<String>> entry1 : response.entrySet()) {
System.out.println("Key = " + entry.getKey() +
", Value = " + entry.getValue());
System.out.println("Key = " + entry1.getKey() +
", Value = " + entry1.getValue());
if(request.keySet().equals(response.keySet())) {
timeDiff = (dateTwo.getTime() - dateOne.getTime());
}
}
}
}
}
}
catch (FileNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
I'm not sure if I understood your question correctly but maybe you can do something similiar like the following:
Map<String, List<String>> requests = Map.of("1", List.of("10,13,12"), "2", List.of("8,7,9"), "3", List.of("11"));
Map<String, List<String>> responses = Map.of("1", List.of("9,10,14"), "2", List.of("8,9,6,12"));
for(Map.Entry<String, List<String>> requestEntry : requests.entrySet()) {
String transactionId = requestEntry.getKey();
if(responses.containsKey(transactionId)) {
System.out.println("Transaction Id: " + transactionId);
for(int i = 0; i < min(requestEntry.getValue().size(), responses.get(transactionId).size()); i++) {
List<String> requestTimes = asList(requestEntry.getValue().get(i).split(","));
List<String> responseTimes = asList(responses.get(transactionId).get(i).split(","));
for(int j = 0; j < min(requestTimes.size(), responseTimes.size()); j++) {
int requestTime = parseInt(requestTimes.get(j));
int responseTime = parseInt(responseTimes.get(j));
System.out.println("Difference: " + abs(requestTime - responseTime));
}
}
}
}
As you can see there are no responses for transactionId 3 so this will be ignored.
If elements in the list for a key differ in size (transactionId 2) the surplus elements will also be ignored.
Transaction Id: 1
Difference: 1
Difference: 3
Difference: 2
Transaction Id: 2
Difference: 0
Difference: 2
Difference: 3

Overall count for substrings in a string java

I have a program which takes tweets from twitter which contain a specific word and searchs through each tweet to count the occurrences of another word that relates to the topic (e.g. in this case the main word is cameron and it's searching for tax and panama.) I have it working so it counts for that specific tweet but I can't seem to work out how to get an accumulative count for all the occurrences. I've played around with incrementing a variable when the word occurs but it doesn't seem to work. The code is below, I've taken out my twitter API keys for obvious reasons.
public class TwitterWordCount {
public static void main(String[] args) {
ConfigurationBuilder configBuilder = new ConfigurationBuilder();
configBuilder.setOAuthConsumerKey(XXXXXXXXXXXXXXXXXX);
configBuilder.setOAuthConsumerSecret(XXXXXXXXXXXXXXXXXX);
configBuilder.setOAuthAccessToken(XXXXXXXXXXXXXXXXXX);
configBuilder.setOAuthAccessTokenSecret(XXXXXXXXXXXXXXXXXX);
//create instance of twitter for searching etc.
TwitterFactory tf = new TwitterFactory(configBuilder.build());
Twitter twitter = tf.getInstance();
//build query
Query query = new Query("cameron");
//number of results pulled each time
query.setCount(100);
//set the language of the tweets that we want
query.setLang("en");
//Execute the query
QueryResult result;
try {
result = twitter.search(query);
//Get the results
List<Status> tweets = result.getTweets();
//Print out the information
for (Status tweet : tweets) {
//get information about the tweet
String userName = tweet.getUser().getName();
long userId = tweet.getUser().getId();
Date creationDate = tweet.getCreatedAt();
String tweetText = tweet.getText();
//print out the information
System.out.println();
System.out.println("Tweeted by " + userName + "(" + userId + ") on date " + creationDate);
System.out.println("Tweet: " + tweetText);
// System.out.println();
String s = tweetText;
Pattern pattern = Pattern.compile("\\w+");
Matcher matcher = pattern.matcher(s);
while (matcher.find()) {
System.out.print(matcher.group() + " ");
}
String str = s;
String findStr = "tax";
int lastIndex = 0;
int count = 0;
//int countall = 0;
while (lastIndex != -1) {
lastIndex = str.indexOf(findStr, lastIndex);
if (lastIndex != -1) {
count++;
lastIndex += findStr.length();
//countall++;
}
}
System.out.println();
System.out.println(findStr + " = " + count);
String two = tweetText;
String str2 = two;
String findStr2 = "panama";
int lastIndex2 = 0;
int count2 = 0;
while (lastIndex2 != -1) {
lastIndex2 = str2.indexOf(findStr2, lastIndex2);
if (lastIndex2 != -1) {
count++;
lastIndex2 += findStr.length();
}
System.out.println(findStr2 + " = " + count2);
}
}
}
catch (TwitterException ex) {
ex.printStackTrace();
}
}
}
I'm also aware that this definitely isn't the cleanest of programs, it's work in progress!
You must define your count variables outside of the for-loop.
int countKeyword1 = 0;
int countKeyword2 = 0;
for (Status tweet : tweets) {
//increase count variables in you while loops
}
System.out.Println("Keyword1 occurrences : " + countKeyword1 );
System.out.Println("Keyword2 occurrences : " + countKeyword2 );
System.out.Println("All occurrences : " + (countKeyword1 + countKeyword2) );

Using Kafka low-level API, should I commit the offset when finished fetching data?

public void run() {
// find the meta data about the topic and partition we are interested in
PartitionMetadata metadata = findLeader(a_seedBrokers, a_port, a_topic, a_partition);
if (metadata == null) {
System.out.println("Can't find metadata for Topic and Partition. Exiting");
return;
}
if (metadata.leader() == null) {
System.out.println("Can't find Leader for Topic and Partition. Exiting");
return;
}
String leadBroker = metadata.leader().host();
String clientName = "Client_" + a_topic + "_" + a_partition;
SimpleConsumer consumer = new SimpleConsumer(leadBroker, a_port, 100000, 64 * 1024, clientName);
long readOffset = getLastOffset(consumer,a_topic, a_partition, kafka.api.OffsetRequest.EarliestTime(), clientName);
//long readOffset = getLastOffset(consumer,a_topic, a_partition, kafka.api.OffsetRequest.LatestTime(), clientName);
int numErrors = 0;
while (a_maxReads > 0) {
if (consumer == null) {
consumer = new SimpleConsumer(leadBroker, a_port, 100000, 64 * 1024, clientName);
}
FetchRequest req = new FetchRequestBuilder()
.clientId(clientName)
.addFetch(a_topic, a_partition, readOffset, 100000) // Note: this fetchSize of 100000 might need to be increased if large batches are written to Kafka
.build();
FetchResponse fetchResponse = consumer.fetch(req);
if (fetchResponse.hasError()) {
numErrors++;
// Something went wrong!
short code = fetchResponse.errorCode(a_topic, a_partition);
System.out.println("Error fetching data from the Broker:" + leadBroker + " Reason: " + code);
if (numErrors > 5) break;
if (code == ErrorMapping.OffsetOutOfRangeCode()) {
// We asked for an invalid offset. For simple case ask for the last element to reset
readOffset = getLastOffset(consumer,a_topic, a_partition, kafka.api.OffsetRequest.LatestTime(), clientName);
continue;
}
consumer.close();
consumer = null;
try {
leadBroker = findNewLeader(leadBroker, a_topic, a_partition, a_port);
} catch (Exception e) {
e.printStackTrace();
}
continue;
}
numErrors = 0;
long numRead = 0;
for (MessageAndOffset messageAndOffset : fetchResponse.messageSet(a_topic, a_partition)) {
long currentOffset = messageAndOffset.offset();
if (currentOffset < readOffset) {
System.out.println("Found an old offset: " + currentOffset + " Expecting: " + readOffset);
continue;
}
readOffset = messageAndOffset.nextOffset();
ByteBuffer payload = messageAndOffset.message().payload();
byte[] bytes = new byte[payload.limit()];
payload.get(bytes);
try {
dataPoints.add(simpleAPIConsumer.parse(simpleAPIConsumer.deserializing(bytes)));//add data to List
} catch (Exception e) {
e.printStackTrace();
}
numRead++;
a_maxReads--;
}
if (numRead == 0) {
try {
Thread.sleep(1000);
} catch (InterruptedException ie) {
}
}
}
simpleAPIConsumer.dataHandle(dataPoints);//Handel Data
if (consumer != null) consumer.close();
}
I found this method in Kafka source. Should I use it?
/**
* Commit offsets for a topic to Zookeeper
* #param request a [[kafka.javaapi.OffsetCommitRequest]] object.
* #return a [[kafka.javaapi.OffsetCommitResponse]] object.
*/
def commitOffsets(request: kafka.javaapi.OffsetCommitRequest):kafka.javaapi.OffsetCommitResponse = {
import kafka.javaapi.Implicits._
underlying.commitOffsets(request.underlying)
}
The purpose of committing an offset after every fetch is to achieve exactly-once message processing.
You need to make sure that you commit offset once you processed the message (where "process" means whatever you do with a message after you pull it out from Kafka). Like you're wrapping message processing and offset commit into a transaction, where either both succeed or fail.
This way, if your client crashes you'll be able to start from the correct offset after you restart.

How to retrieve more than 100 results using Twitter4j

I'm using the Twitter4j library to retrieve tweets, but I'm not getting nearly enough for my purposes. Currently, I'm getting that maximum of 100 from one page. How do I implement maxId and sinceId into the below code in Processing in order to retrieve more than the 100 results from the Twitter search API? I'm totally new to Processing (and programming in general), so any bit of direction on this would be awesome! Thanks!
void setup() {
ConfigurationBuilder cb = new ConfigurationBuilder();
cb.setOAuthConsumerKey("xxxx");
cb.setOAuthConsumerSecret("xxxx");
cb.setOAuthAccessToken("xxxx");
cb.setOAuthAccessTokenSecret("xxxx");
Twitter twitter = new TwitterFactory(cb.build()).getInstance();
Query query = new Query("#peace");
query.setCount(100);
try {
QueryResult result = twitter.search(query);
ArrayList tweets = (ArrayList) result.getTweets();
for (int i = 0; i < tweets.size(); i++) {
Status t = (Status) tweets.get(i);
GeoLocation loc = t.getGeoLocation();
if (loc!=null) {
tweets.get(i++);
String user = t.getUser().getScreenName();
String msg = t.getText();
Double lat = t.getGeoLocation().getLatitude();
Double lon = t.getGeoLocation().getLongitude();
println("USER: " + user + " wrote: " + msg + " located at " + lat + ", " + lon);
}
}
}
catch (TwitterException te) {
println("Couldn't connect: " + te);
};
}
void draw() {
}
Unfortunately you can't, at least not in a direct way such as doing
query.setCount(101);
As the javadoc says it will only allow up to 100 tweets.
In order to overcome this, you just have to ask for them in batches and in every batch set the maximum ID that you get to be 1 less than the last Id you got from the last one. To wrap this up, you gather every tweet from the process into an ArrayList (which by the way should not stay generic, but have its type defined as ArrayList<Status> - An ArrayList that carries Status objects) and then print everything! Here's an implementation:
void setup() {
ConfigurationBuilder cb = new ConfigurationBuilder();
cb.setOAuthConsumerKey("xxxx");
cb.setOAuthConsumerSecret("xxxx");
cb.setOAuthAccessToken("xxxx");
cb.setOAuthAccessTokenSecret("xxxx");
Twitter twitter = new TwitterFactory(cb.build()).getInstance();
Query query = new Query("#peace");
int numberOfTweets = 512;
long lastID = Long.MAX_VALUE;
ArrayList<Status> tweets = new ArrayList<Status>();
while (tweets.size () < numberOfTweets) {
if (numberOfTweets - tweets.size() > 100)
query.setCount(100);
else
query.setCount(numberOfTweets - tweets.size());
try {
QueryResult result = twitter.search(query);
tweets.addAll(result.getTweets());
println("Gathered " + tweets.size() + " tweets");
for (Status t: tweets)
if(t.getId() < lastID) lastID = t.getId();
}
catch (TwitterException te) {
println("Couldn't connect: " + te);
};
query.setMaxId(lastID-1);
}
for (int i = 0; i < tweets.size(); i++) {
Status t = (Status) tweets.get(i);
GeoLocation loc = t.getGeoLocation();
String user = t.getUser().getScreenName();
String msg = t.getText();
String time = "";
if (loc!=null) {
Double lat = t.getGeoLocation().getLatitude();
Double lon = t.getGeoLocation().getLongitude();
println(i + " USER: " + user + " wrote: " + msg + " located at " + lat + ", " + lon);
}
else
println(i + " USER: " + user + " wrote: " + msg);
}
}
Note: The line
ArrayList<Status> tweets = new ArrayList<Status>();
should properly be:
List<Status> tweets = new ArrayList<Status>();
because you should always use the interface in case you want to add a different implementation. This of course, if you are on Processing 2.x will require this in the beginning:
import java.util.List;
Here's the function I made for my app based on the past answers. Thank you everybody for your solutions.
List<Status> tweets = new ArrayList<Status>();
void getTweets(String term)
{
int wantedTweets = 112;
long lastSearchID = Long.MAX_VALUE;
int remainingTweets = wantedTweets;
Query query = new Query(term);
try
{
while(remainingTweets > 0)
{
remainingTweets = wantedTweets - tweets.size();
if(remainingTweets > 100)
{
query.count(100);
}
else
{
query.count(remainingTweets);
}
QueryResult result = twitter.search(query);
tweets.addAll(result.getTweets());
Status s = tweets.get(tweets.size()-1);
firstQueryID = s.getId();
query.setMaxId(firstQueryID);
remainingTweets = wantedTweets - tweets.size();
}
println("tweets.size() "+tweets.size() );
}
catch(TwitterException te)
{
System.out.println("Failed to search tweets: " + te.getMessage());
System.exit(-1);
}
}
From the Twitter search API doc:
At this time, users represented by access tokens can make 180 requests/queries per 15 minutes. Using application-only auth, an application can make 450 queries/requests per 15 minutes on its own behalf without a user context.
You can wait for 15 min and then collect another batch of 400 Tweets, something like:
if(tweets.size() % 400 == 0 ) {
try {
Thread.sleep(900000);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
Just keep track of the lowest Status id and use that to set the max_id for subsequent search calls. This will allow you to step back through the results 100 at a time until you've got enough, e.g.:
boolean finished = false;
while (!finished) {
final QueryResult result = twitter.search(query);
final List<Status> statuses = result.getTweets();
long lowestStatusId = Long.MAX_VALUE;
for (Status status : statuses) {
// do your processing here and work out if you are 'finished' etc...
// Capture the lowest (earliest) Status id
lowestStatusId = Math.min(status.getId(), lowestStatusId);
}
// Subtracting one here because 'max_id' is inclusive
query.setMaxId(lowestStatusId - 1);
}
See Twitter's guide on Working with Timelines for more information.

Download a Large Number of Files Using the Java SDK for Amazon S3 Bucket

I have a large number of files that need to be downloaded from an S3 bucket. My problem is similar to this article except I am trying to run it in Java.
public static void main(String args[]) {
AWSCredentials myCredentials = new BasicAWSCredentials("key","secret");
TransferManager tx = new TransferManager(myCredentials);
File file = <thefile>
try{
MultipleFileDownload myDownload = tx.downloadDirectory("<bucket>", null, file);
System.out.println("Transfer: " + myDownload.getDescription());
System.out.println(" - State: " + myDownload.getState());
System.out.println(" - Progress: " + myDownload.getProgress().getBytesTransfered());
while (myDownload.isDone() == false) {
System.out.println("Transfer: " + myDownload.getDescription());
System.out.println(" - State: " + myDownload.getState());
System.out.println(" - Progress: " + myDownload.getProgress().getBytesTransfered());
try {
// Do work while we wait for our upload to complete...
Thread.sleep(500);
} catch (InterruptedException ex) {
ex.printStackTrace();
}
}
} catch(Exception e){
e.printStackTrace();
}
}
This was adapted from the TransferManager class example for multiple upload. There are well over a 100,000 objects in this bucket. Any help would be great.
Please use the list() method to get a list of your files, then use the get() method to get each file.
class S3 extends AmazonS3Client {
final String bucket;
S3(String u, String p, String Bucket) {
super(new BasicAWSCredentials(u, p));
bucket = Bucket;
}
String get(String k) {
try {
final S3Object f = getObject(bucket, k);
final BufferedInputStream i = new BufferedInputStream(f.getObjectContent());
final StringBuilder s = new StringBuilder();
final byte[] b = new byte[1024];
for (int n = i.read(b); n != -1; n = i.read(b)) {
s.append(new String(b, 0, n));
}
return s.toString();
} catch (Exception e) {
log("Cannot get " + bucket + "/" + k + " from S3 because " + e);
}
return null;
}
String[] list(String d) {
try {
final ObjectListing l = listObjects(bucket, d);
final List<S3ObjectSummary> L = l.getObjectSummaries();
final int n = L.size();
final String[] s = new String[n];
for (int i = 0; i < n; ++i) {
final S3ObjectSummary k = L.get(i);
s[i] = k.getKey();
}
return s;
} catch (Exception e) {
log("Cannot list " + bucket + "/" + d + " on S3 because " + e);
}
return new String[]{};
}
}
TransferManager internally uses countdownlatch which makes me believe is does concurrent download (which seems the right way to do it). It makes sense to use it than get one file after other sequentially?

Categories