Only half of the BinaryDocument(s) are getting inserted during bulk insert - java

I am having a weird problem during insertion. I have two types of documents - JSON and BinaryDocument. I am performing bulk insert operation restricted to a batch size.
The operation works fine for JSON documents. But if I upload, say 100 documents, then only 50 are getting upload in the case of BinaryDocument. Every time only half the number of documents are getting loaded in the database.
Here is my code for JSON document insertion:
public void createMultipleCustomerDocuments(String docId, Customer myCust, long numDocs, int batchSize) {
Gson gson = new GsonBuilder().create();
JsonObject content = JsonObject.fromJson(gson.toJson(myCust));
JsonDocument document = JsonDocument.create(docId, content);
jsonDocuments.add(document);
documentCounter.incrementAndGet();
System.out.println("Batch size: " + batchSize + " Document Counter: " + documentCounter.get());
if(documentCounter.get() >= batchSize){
System.out.println("Document counter: " + documentCounter.get());
Observable
.from(jsonDocuments)
.flatMap(new Func1<JsonDocument, Observable<JsonDocument>>() {
public Observable<JsonDocument> call(final JsonDocument docToInsert) {
return theBucket.async().upsert(docToInsert);
}
})
.last()
.toList()
.toBlocking()
.single();
jsonDocuments.clear();
documentCounter.set(0);
}
}
This works completely fine. I have no problem in insertion.
Here is the code for my BinaryDocument insertion:
public void createMultipleCustomerDocuments(final String docId, ByteBuffer myCust, long numDocs, int batchSize) throws BackpressureException, InterruptedException {
ByteBuf buffer = Unpooled.wrappedBuffer(myCust);
binaryDocuments.add(buffer);
documentCounter.incrementAndGet();
System.out.println("Batch size: " + batchSize + " Document Counter: " + documentCounter.get());
if(documentCounter.get() >= batchSize){
System.out.println("Document counter: " + documentCounter.get() + " Binary Document list size: " + binaryDocuments.size());
Observable
.from(binaryDocuments)
.flatMap(new Func1<ByteBuf, Observable<BinaryDocument>>() {
public Observable<BinaryDocument> call(final ByteBuf docToInsert) {
//docToInsert.retain();
return theBucket.async().upsert(BinaryDocument.create(docId, docToInsert));
}
})
.last()
.toList()
.toBlocking()
.single();
binaryDocuments.clear();
documentCounter.set(0);
}
}
This fails. Exactly half the number of documents get inserted. Even the numbers are printed in exactly the same manner as of JSON document's function's numbers. The documentCounter shows the correct number. But the number of documents that get inserted in the DB is only the half of what it is shown.
Can someone please help me this?

You seem to be using the same document id (i.e the docId of the last member of the batch) to create all documents in the same batch
.BinaryDocument.create(docId, docToInsert)
You should build up your array of BinaryDocument outside the if statement (like you did with the JsonDocument version). Something like
public void createMultipleCustomerDocuments(final String docId, ByteBuffer myCust, int batchSize) throws BackpressureException, InterruptedException {
// numDocs is redundant
ByteBuf buffer = Unpooled.wrappedBuffer(myCust);
binaryDocuments.add(BinaryDocument.create(docId, buffer)); // ArrayList<BinaryDocument> type
documentCounter.incrementAndGet();
System.out.println("Batch size: " + batchSize + " Document Counter: " + documentCounter.get());
if(documentCounter.get() >= batchSize){
System.out.println("Document counter: " + documentCounter.get() + " Binary Document list size: " + binaryDocuments.size());
Observable
.from(binaryDocuments)
.flatMap(new Func1<BinaryDocument, Observable<BinaryDocument>>() {
public Observable<BinaryDocument> call(final BinaryDocument docToInsert) {
return theBucket.async().upsert(docToInsert);
}
})
.last()
.toBlocking()
.single();
binaryDocuments.clear();
documentCounter.set(0);
}
}
should work.

Related

Akka references increasing constantly with Play Framework

I have changed all my multi-thread actions in my application to Akka a few weeks ago.
However, since it seems that I am starting to run out of Heap space (after a week or so).
By basically looking at all actors with
ActorSelection selection = getContext().actorSelection("/*");
the number of actors seems to increase all the time. After an hour of running I have more then 2200. They are called like:
akka://application/user/$Aic
akka://application/user/$Alb
akka://application/user/$Alc
akka://application/user/$Am
akka://application/user/$Amb
I also noticed that when opening websockets (and closing them) there are these:
akka://application/system/Materializers/StreamSupervisor-2/flow-21-0-unnamed
akka://application/system/Materializers/StreamSupervisor-2/flow-2-0-unnamed
akka://application/system/Materializers/StreamSupervisor-2/flow-27-0-unnamed
akka://application/system/Materializers/StreamSupervisor-2/flow-23-0-unnamed
Is there something specific that I need to do to close them and let them be cleaned?
I am not sure the memory issue is related, but the fact that there seem so many after an hour on the production server it could be.
[EDIT: added the code to analyse/count the actors]
public class RetrieveActors extends AbstractActor {
private String identifyId;
private List<String> list;
public RetrieveActors(String identifyId) {
Logger.debug("Actor retriever identity: " + identifyId);
this.identifyId = identifyId;
}
#Override
public Receive createReceive() {
Logger.info("RetrieveActors");
return receiveBuilder()
.match(String.class, request -> {
//Logger.info("Message: " + request + " " + new Date());
if(request.equalsIgnoreCase("run")) {
list = new ArrayList<>();
ActorSelection selection = getContext().actorSelection("/*");
selection.tell(new Identify(identifyId), getSelf());
//ask(selection, new Identify(identifyId), 1000).thenApply(response -> (Object) response).toCompletableFuture().get();
} else if(request.equalsIgnoreCase("result")) {
//Logger.debug("Run list: " + list + " " + new Date());
sender().tell(list, self());
} else {
sender().tell("Wrong command: " + request, self());
}
}).match(ActorIdentity.class, identity -> {
if (identity.correlationId().equals(identifyId)) {
ActorRef ref = identity.getActorRef().orElse(null);
if (ref != null) { // to avoid NullPointerExceptions
// Log or store the identity of the actor who replied
//Logger.info("The actor " + ref.path().toString() + " exists and has replied!");
list.add(ref.path().toString());
// We want to discover all children of the received actor (recursive traversal)
ActorSelection selection = getContext().actorSelection(ref.path().toString() + "/*");
selection.tell(new Identify(identifyId), getSelf());
}
}
sender().tell(list.toString(), self());
}).build();
}
}

Commit Offsets to Kafka on Spark Executors

I am getting events from Kafka, enriching/filtering/transforming them on Spark and then storing them in ES. I am committing back the offsets to Kafka
I have two questions/problems:
(1) My current Spark job is VERY slow
I have 50 partitions for a topic and 20 executors. Each executor has 2 cores and 4g of memory each. My driver has 8g of memory. I am consuming 1000 events/partition/second and my batch interval is 10 seconds. This means, I am consuming 500000 events in 10 seconds
My ES cluster is as follows:
20 shards / index
3 master instances c5.xlarge.elasticsearch
12 instances m4.xlarge.elasticsearch
disk / node = 1024 GB so 12 TB in total
And I am getting huge scheduling and processing delays
(2) How can I commit offsets on executors?
Currently, I enrich/transform/filter my events on executors and then send everything to ES using BulkRequest. It's a synchronous process. If I get positive feedback, I send the offset list to driver. If not, I send back an empty list. On the driver, I commit offsets to Kafka. I believe, there should be a way, where I can commit offsets on executors but I don't know how to pass kafka Stream to executors:
((CanCommitOffsets) kafkaStream.inputDStream()).commitAsync(offsetRanges, this::onComplete);
This is the code for committing offsets to Kafka which requires Kafka Stream
Here is my overall code:
kafkaStream.foreachRDD( // kafka topic
rdd -> { // runs on driver
rdd.cache();
String batchIdentifier =
Long.toHexString(Double.doubleToLongBits(Math.random()));
LOGGER.info("## [" + batchIdentifier + "] Starting batch ...");
Instant batchStart = Instant.now();
List<OffsetRange> offsetsToCommit =
rdd.mapPartitionsWithIndex( // kafka partition
(index, eventsIterator) -> { // runs on worker
OffsetRange[] offsetRanges = ((HasOffsetRanges) rdd.rdd()).offsetRanges();
LOGGER.info(
"## Consuming " + offsetRanges[index].count() + " events" + " partition: " + index
);
if (!eventsIterator.hasNext()) {
return Collections.emptyIterator();
}
// get single ES documents
List<SingleEventBaseDocument> eventList = getSingleEventBaseDocuments(eventsIterator);
// build request wrappers
List<InsertRequestWrapper> requestWrapperList = getRequestsToInsert(eventList, offsetRanges[index]);
LOGGER.info(
"## Processed " + offsetRanges[index].count() + " events" + " partition: " + index + " list size: " + eventList.size()
);
BulkResponse bulkItemResponses = elasticSearchRepository.addElasticSearchDocumentsSync(requestWrapperList);
if (!bulkItemResponses.hasFailures()) {
return Arrays.asList(offsetRanges).iterator();
}
elasticSearchRepository.close();
return Collections.emptyIterator();
},
true
).collect();
LOGGER.info(
"## [" + batchIdentifier + "] Collected all offsets in " + (Instant.now().toEpochMilli() - batchStart.toEpochMilli()) + "ms"
);
OffsetRange[] offsets = new OffsetRange[offsetsToCommit.size()];
for (int i = 0; i < offsets.length ; i++) {
offsets[i] = offsetsToCommit.get(i);
}
try {
offsetManagementMapper.commit(offsets);
} catch (Exception e) {
// ignore
}
LOGGER.info(
"## [" + batchIdentifier + "] Finished batch of " + offsetsToCommit.size() + " messages " +
"in " + (Instant.now().toEpochMilli() - batchStart.toEpochMilli()) + "ms"
);
rdd.unpersist();
});
You can move the offset logic above the rdd loop ... I am using below template for better offset handling and performance
JavaInputDStream<ConsumerRecord<String, String>> kafkaStream = KafkaUtils.createDirectStream(jssc,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String, String>Subscribe(topics, kafkaParams));
kafkaStream.foreachRDD( kafkaStreamRDD -> {
//fetch kafka offsets for manually commiting it later
OffsetRange[] offsetRanges = ((HasOffsetRanges) kafkaStreamRDD.rdd()).offsetRanges();
//filter unwanted data
kafkaStreamRDD.filter(
new Function<ConsumerRecord<String, String>, Boolean>() {
#Override
public Boolean call(ConsumerRecord<String, String> kafkaRecord) throws Exception {
if(kafkaRecord!=null) {
if(!StringUtils.isAnyBlank(kafkaRecord.key() , kafkaRecord.value())) {
return Boolean.TRUE;
}
}
return Boolean.FALSE;
}
}).foreachPartition( kafkaRecords -> {
// init connections here
while(kafkaRecords.hasNext()) {
ConsumerRecord<String, String> kafkaConsumerRecord = kafkaRecords.next();
// work here
}
});
//commit offsets
((CanCommitOffsets) kafkaStream.inputDStream()).commitAsync(offsetRanges);
});

Java ExecutorService Runnable doesn't update value

I'm using Java to download HTML contents of websites whose URLs are stored in a database. I'd like to put their HTML into database, too.
I'm using Jsoup for this purpose:
public String downloadHTML(String byLink) {
String htmlInPage = "";
try {
Document doc = Jsoup.connect(byLink).get();
htmlInPage = doc.html();
} catch (org.jsoup.UnsupportedMimeTypeException e) {
// process this and some other exceptions
}
return htmlInPage;
}
I'd like to download websites concurrently and use this function:
public void downloadURL(int websiteId, String url,
String categoryName, ExecutorService executorService) {
executorService.submit((Runnable) () -> {
String htmlInPage = downloadHTML(url);
System.out.println("Category: " + categoryName + " " + websiteId + " " + url);
String insertQuery =
"INSERT INTO html_data (website_id, html_contents) VALUES (?,?)";
dbUtils.query(insertQuery, websiteId, htmlInPage);
});
}
dbUtils is my class based on Apache Commons DbUtils. Details are here: http://pastebin.com/iAKXchbQ
And I'm using everything mentioned above in a such way: (List<Object[]> details are explained on pastebin, too)
public static void main(String[] args) {
DbUtils dbUtils = new DbUtils("host", "db", "driver", "user", "pass");
List<String> categoriesList =
Arrays.asList("weapons", "planes", "cooking", "manga");
String sql = "SELECT lw.id, lw.website_url, category_name " +
"FROM list_of_websites AS lw JOIN list_of_categories AS lc " +
"ON lw.category_id = lc.id " +
"where category_name = ? ";
ExecutorService executorService = Executors.newFixedThreadPool(10);
for (String category : categoriesList) {
List<Object[]> sitesInCategory = dbUtils.select(sql, category );
for (Object[] entry : sitesInCategory) {
int websiteId = (int) entry[0];
String url = (String) entry[1];
String categoryName = (String) entry[2];
downloadURL(websiteId, url, categoryName, executorService);
}
}
executorService.shutdown();
}
I'm not sure if this solution is correct but it works. Now I want to modify code to save HTML not from all websites in my database, but only their fixed ammount in each category.
For example, download and save HTML of 50 websites from the "weapons" category, 50 from "planes", etc. I don't think it's necessary to use sql for this purpose: if we select 50 sites per category, it doesn't mean we save them all, because of possibly incorrect syntax and connection problems.
I've tryed to create separate class implementing Runnable with fields: counter and maxWebsitesPerCategory, but these variables aren't updated. Another idea was to create field Map<String,Integer> sitesInCategory instead of counter, put each category as a key there and increment its value until it reaches maxWebsitesPerCategory, but it didn't work, too. Please, help me!
P.S: I'll also be grateful for any recommendations connected with my realization of concurrent downloading (I haven't worked with concurrency in Java before and this is my first attempt)
How about this?
for (String category : categoriesList) {
dbUtils.select(sql, category).stream()
.limit(50)
.forEach(entry -> {
int websiteId = (int) entry[0];
String url = (String) entry[1];
String categoryName = (String) entry[2];
downloadURL(websiteId, url, categoryName, executorService);
});
}
sitesInCategory has been replaced with a stream of at most 50 elements, then your code is run on each entry.
EDIT
In regard to comments. I've gone ahead and restructured a bit, you can modify/implement the content of the methods I've suggested.
public void werk(Queue<Object[]> q, ExecutorService executorService) {
executorService.submit(() -> {
try {
Object[] o = q.remove();
try {
String html = downloadHTML(o); // this takes one of your object arrays and returns the text of an html page
insertIntoDB(html); // this is the code in the latter half of your downloadURL method
}catch (/*narrow exception type indicating download failure*/Exception e) {
werk(q, executorService);
}
}catch (NoSuchElementException e) {}
});
}
^^^ This method does most of the work.
for (String category : categoriesList) {
Queue<Object[]> q = new ConcurrentLinkedQueue<>(dbUtils.select(sql, category));
IntStream.range(0, 50).forEach(i -> werk(q, executorService));
}
^^^ this is the for loop in your main
Now each category tries to download 50 pages, upon failure of downloading a page it moves on and tries to download another page. In this way, you will either download 50 pages or have attempted to download all pages in the category.

twitter4j - Count the number of tweets within 24 hours, return an integer

I'm trying to retrieve a single integer of the number of tweets of a certain keyword within 24 hours.
So say the keyword is "traffic" I want to count the number of tweets with the word "traffic" within the past 24, and store it as a number, to be used to generate other things.
Right now I can provide a specific number using query.setCount and retrieve an arbitrary number(1024) tweets in the past 24 hours, but I have no way of telling if this is ALL the tweets within 24 hours, all I really want is a number, I don't need the actual text or other information of the tweets. Also, as new tweets come in, have that number update.
How could I go about doing this?
Here's my getNewTweets method so far:
void getNewTweets(){
SimpleDateFormat sdf = new SimpleDateFormat("y-M-d");
Calendar calendar = Calendar.getInstance();
calendar.add(Calendar.HOUR_OF_DAY, -24);
String yesterday = sdf.format(calendar.getTime());
Query query = new Query("traffic");
query.setSince(yesterday);
int numberOfTweets = 1024;
long lastID = Long.MAX_VALUE;
while (tweets.size () < numberOfTweets) {
if (numberOfTweets - tweets.size() > 100)
query.setCount(100);
else
query.setCount(numberOfTweets - tweets.size());
try {
QueryResult result = twitter.search(query);
tweets.addAll(result.getTweets());
println("Gathered " + tweets.size() + " tweets");
for (Status t: tweets)
if(t.getId() < lastID) lastID = t.getId();
}
catch (TwitterException te) {
println("Couldn't connect: " + te);
};
query.setMaxId(lastID-1);
}
}
You cannot tell the exact count of tweets for a specific filter/search query,
Both the API's are rate limited.
You would have to use firehose to get all the tweets data and that is paid.
Below is an excerpt form twitter dev -
Before getting involved, it’s important to know that the Search API is focused on relevance and not completeness. This means that some Tweets and users may be missing from search results. If you want to match for completeness you should consider using a Streaming API instead
Please read following links for more understanding on Streaming API's rate limiting -
https://twittercommunity.com/t/how-much-data-returned-when-using-streaming-api/8407
That said (#mbaxi answer) I think that for a not really popular word the Stream API would be suitable for that task. I'm running this code for 5 minutes using the very popular "love" and got no warnings so far, also got about 25000 tweets in love...
I made this very simple and not precise timer just for the example sake... Although you said you don't want the text, it's is being printed to console...
Here an example
import twitter4j.util.*;
import twitter4j.*;
import twitter4j.management.*;
import twitter4j.api.*;
import twitter4j.conf.*;
import twitter4j.json.*;
import twitter4j.auth.*;
int startTime;
int tweetNumber;
PFont f ;
String theWord = "love";
TwitterStream twitterStream;
void setup() {
size(800, 100);
background(0);
f = createFont("SourceCodePro-Regular", 25);
textFont(f);
openTwitterStream();
startTime = minute();
}
void draw() {
background(0);
int passedTime = minute() - startTime;
text("Received " + nf(tweetNumber, 5) + " tweets with the word: " + theWord, 30, height - 50);
text("in last " + nf(passedTime, 3) + " minutes", 30, height - 25);
}
// Stream it
void openTwitterStream() {
ConfigurationBuilder cb = new ConfigurationBuilder();
cb.setOAuthConsumerKey("-----FILL-----");
cb.setOAuthConsumerSecret("-----FILL-----");
cb.setOAuthAccessToken("-----FILL-----");
cb.setOAuthAccessTokenSecret("-----FILL-----");
TwitterStream twitterStream = new TwitterStreamFactory(cb.build()).getInstance();
FilterQuery filtered = new FilterQuery();
// if you enter keywords here it will filter, otherwise it will sample
String keywords[] = {
theWord
};
filtered.track(keywords);
twitterStream.addListener(listener);
if (keywords.length==0) {
// sample() method internally creates a thread which manipulates TwitterStream
twitterStream.sample(); // and calls these adequate listener methods continuously.
} else {
twitterStream.filter(filtered);
}
println("connected");
}
// Implementing StatusListener interface
StatusListener listener = new StatusListener() {
//#Override
public void onStatus(Status status) {
tweetNumber++;
System.out.println("#" + status.getUser().getScreenName() + " - " + status.getText());
}
//#Override
public void onDeletionNotice(StatusDeletionNotice statusDeletionNotice) {
System.out.println("Got a status deletion notice id:" + statusDeletionNotice.getStatusId());
}
//#Override
public void onTrackLimitationNotice(int numberOfLimitedStatuses) {
System.out.println("Got track limitation notice:" + numberOfLimitedStatuses);
}
//#Override
public void onScrubGeo(long userId, long upToStatusId) {
System.out.println("Got scrub_geo event userId:" + userId + " upToStatusId:" + upToStatusId);
}
//#Override
public void onStallWarning(StallWarning warning) {
System.out.println("Got stall warning:" + warning);
}
//#Override
public void onException(Exception ex) {
ex.printStackTrace();
}
};

Created documents are not versionable

I use OpenCmis in-memory for testing. But when I create a document I am not allowed to set the versioningState to something else then versioningState.NONE.
The doc created is not versionable some way... I used the code from http://chemistry.apache.org/java/examples/example-create-update.html
The test method:
public void test() {
String filename = "test123";
Folder folder = this.session.getRootFolder();
// Create a doc
Map<String, Object> properties = new HashMap<String, Object>();
properties.put(PropertyIds.OBJECT_TYPE_ID, "cmis:document");
properties.put(PropertyIds.NAME, filename);
String docText = "This is a sample document";
byte[] content = docText.getBytes();
InputStream stream = new ByteArrayInputStream(content);
ContentStream contentStream = this.session.getObjectFactory().createContentStream(filename, Long.valueOf(content.length), "text/plain", stream);
Document doc = folder.createDocument(
properties,
contentStream,
VersioningState.MAJOR);
}
The exception I get:
org.apache.chemistry.opencmis.commons.exceptions.CmisConstraintException: The versioning state flag is imcompatible to the type definition.
What am I missing?
I found the reason...
By executing the following code I discovered that the OBJECT_TYPE_ID 'cmis:document' don't allow versioning.
Code to view all available OBJECT_TYPE_ID's (source):
boolean includePropertyDefintions = true;
for (t in session.getTypeDescendants(
null, // start at the top of the tree
-1, // infinite depth recursion
includePropertyDefintions // include prop defs
)) {
printTypes(t, "");
}
static void printTypes(Tree tree, String tab) {
ObjectType objType = tree.getItem();
println(tab + "TYPE:" + objType.getDisplayName() +
" (" + objType.getDescription() + ")");
// Print some of the common attributes for this type
print(tab + " Id:" + objType.getId());
print(" Fileable:" + objType.isFileable());
print(" Queryable:" + objType.isQueryable());
if (objType instanceof DocumentType) {
print(" [DOC Attrs->] Versionable:" +
((DocumentType)objType).isVersionable());
print(" Content:" +
((DocumentType)objType).getContentStreamAllowed());
}
println(""); // end the line
for (t in tree.getChildren()) {
// there are more - call self for next level
printTypes(t, tab + " ");
}
}
This resulted in a list like this:
TYPE:CMIS Folder (Description of CMIS Folder Type) Id:cmis:folder
Fileable:true Queryable:true
TYPE:CMIS Document (Description of CMIS Document Type)
Id:cmis:document Fileable:true Queryable:true [DOC Attrs->]
Versionable:false Content:ALLOWED
TYPE:My Type 1 Level 1 (Description of My Type 1 Level 1 Type)
Id:MyDocType1 Fileable:true Queryable:true [DOC Attrs->]
Versionable:false Content:ALLOWED
TYPE:VersionedType (Description of VersionedType Type)
Id:VersionableType Fileable:true Queryable:true [DOC Attrs->]
Versionable:true Content:ALLOWED
As you can see the last OBJECT_TYPE_ID has versionable: true... and when I use that it does work.

Categories