Out of memory in ElasticSearch - java

I m trying to index some data in ES and I m receiving out of memory exception:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at org.elasticsearch.common.jackson.core.util.BufferRecycler.balloc(BufferRecycler.java:155)
at org.elasticsearch.common.jackson.core.util.BufferRecycler.allocByteBuffer(BufferRecycler.java:96)
at org.elasticsearch.common.jackson.core.util.BufferRecycler.allocByteBuffer(BufferRecycler.java:86)
at org.elasticsearch.common.jackson.core.io.IOContext.allocWriteEncodingBuffer(IOContext.java:152)
at org.elasticsearch.common.jackson.core.json.UTF8JsonGenerator.<init>(UTF8JsonGenerator.java:123)
at org.elasticsearch.common.jackson.core.JsonFactory._createUTF8Generator(JsonFactory.java:1284)
at org.elasticsearch.common.jackson.core.JsonFactory.createGenerator(JsonFactory.java:1016)
at org.elasticsearch.common.xcontent.json.JsonXContent.createGenerator(JsonXContent.java:68)
at org.elasticsearch.common.xcontent.XContentBuilder.<init>(XContentBuilder.java:96)
at org.elasticsearch.common.xcontent.XContentBuilder.builder(XContentBuilder.java:77)
at org.elasticsearch.common.xcontent.json.JsonXContent.contentBuilder(JsonXContent.java:38)
at org.elasticsearch.common.xcontent.XContentFactory.contentBuilder(XContentFactory.java:122)
at org.elasticsearch.common.xcontent.XContentFactory.jsonBuilder(XContentFactory.java:49)
at EsController.importProductEs(EsController.java:60)
at Parser.fromCsvToJson(Parser.java:120)
at CsvToJsonParser.parseProductFeeds(CsvToJsonParser.java:43)
at MainParser.main(MainParser.java:49)
This is how I instantiate the ES client:
System.out.println("Elastic search client is instantiated");
Settings settings = ImmutableSettings.settingsBuilder().put("cluster.name", "elasticsearch_brew").build();
client = new TransportClient(settings);
String hostname = "localhost";
int port = 9300;
((TransportClient) client).addTransportAddress(new InetSocketTransportAddress(hostname, port));
bulkRequest = client.prepareBulk();
and then I run the bulk request:
// for each product in the list, we need to include the fields in the bulk request
for(HashMap<String, String> productfields : products)
try {
bulkRequest.add(client.prepareIndex(index,type,productfields.get("Product_Id"))
.setSource(jsonBuilder()
.startObject()
.field("Name",productfields.get("Name") )
.field("Quantity",productfields.get("Quantity"))
.field("Make", productfields.get("Make"))
.field("Price", productfields.get("Price"))
.endObject()
)
);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
//execute the bulk request
BulkResponse bulkResponse = bulkRequest.execute().actionGet();
if (bulkResponse.hasFailures()) {
// process failures by iterating through each bulk response item
}
I am trying to index products from various shops. Each shop is a different index. When I reach the 6th shop containing around 60000 products I get the above exception. I split the bulk request in chunks of 10000, trying to avoid the out of memory problems.
I can't understand where exactly is the bottleneck. Would it help if i somehow flush the bulk request or restart the client??
I ve seen similar posts but non works for me.
EDIT
When I m instantiting a new client every time I process a new bulk request, then I don't get the out of memory exception. But instantiating a new client each time doesnt seem right..
Thank you

So I figured out what was wrong.
Every new bulk request was adding up to the previous one and eventually it was leading to out of memory.
So now before I start a new bulk request I run the
bulkRequest = client.prepareBulk();
which flushes the previous request.
Thank you guys for your comments

Related

Get last JSON array element with HTTP request

Is it possible to take the last / penultimate element of a JSON Array file hosted on GitHub, without downloading the entire file?
Because the file is 10 MB, I only need the last two cells of the array, in fact every time I go to get the information it takes a lot of time to load due to the weight of the file.
File link: https://raw.githubusercontent.com/pcm-dpc/COVID-19/master/dati-json/dpc-covid19-ita-regioni.json
I retrive the information with this code:
RequestQueue queue = Volley.newRequestQueue(getApplicationContext());
JsonArrayRequest jsonArrayRequest = new JsonArrayRequest(Request.Method.GET, JSON_OLD, null, response -> {
try{
for(int i = response.length() - 42; i < response.length() - 21; i++){
JSONObject jsonObject = response.getJSONObject(i);
oldRegionData.add(new RegionData(Integer.parseInt(jsonObject.getString("dimessi_guariti")),Integer.parseInt(jsonObject.getString("deceduti")), jsonObject.getString("denominazione_regione"), Integer.parseInt(jsonObject.getString("nuovi_positivi"))));
}
getNewData(queue);
} catch (JSONException e) {
e.printStackTrace();
}
}, error -> Log.println(Log.ERROR,"Error", "Error while performing this action"));
queue.add(jsonArrayRequest);
As far as I know, there is no way to "jump" to the end of the call response.
Still, I would approach this, this way:
If the data changes constantly:
Recieve the data at a fixed interval and not every time I need it.
Get the data and save it somewhere (through an Object or cache the part of the file that you need in a temporary place).
If the data does not change
Fetch the data when the application starts running and use an Object to keep the data needed

How to catch bulk response with bulk processor?(Java High level rest client)

I'm new to Elasticsearch Java API. I know that there are 2 ways to operate bulk:
construct a bulk request, use client object.
construct a bulk processor, add request to it.
I simulated a large batch of mock data(about 1M pieces), and index them into Elasticsearch(5.6.3) with Java high level rest client.
However,if I use bulk request to index a large batch ,
java.lang.OutOfMemoryError: nullexception
happens when I use client.bulk() method.
Then I try to use bulk processor, and it works.
Here is the code:
RestHighLevelClient client = initESclient();
BulkProcessor bulkProcessor = initES(client);
logger.info("Use bulk request to load data:");
logger.info("start to generate random data...");
String bossMockIndex = customSetting.getMockIndex();
String soapMockType = customSetting.getSoapType();
Long start = System.currentTimeMillis();
Integer batch = customSetting.getMockBatch();
logger.info("Batch:"+batch);
List<IndexRequest> indexRequesList = bossMockDataService.indexRequestGenerator(batch, bossMockIndex, soapMockType);
Long endCreateData = System.currentTimeMillis();
logger.info("Consumption for creating "+batch+" pieces of mock data:"+(endCreateData-start)/1000.0d+"s");
for (IndexRequest indexRequest : indexRequesList) {
bulkProcessor.add(indexRequest);
}
What I wonder is how can I get this bulk response with BulkProcessor just like when I use BulkRequest.I need to update some data in the batch of data.
BulkResponse bulkResponse = bulkProcessor.execute().actionGet();

JEST Bulk Request Issue

I am trying to run a Bulk Request through JEST and want to append my data (say "bills") one at a time and then execute all at once, however when i run the following code on 10 bills just the last bill is getting executed, can someone please correct this code to execute all 10 bills (by executing it outside the for loop ie using Bulk Request)?
for(JSONObject bill : bills) {
bulkRequest = new Bulk.Builder()
.addAction(new Index.Builder(bill.toString()).index(index).type(type).id(id).build())
.build();
}
bulkResponse = Client.execute(bulkRequest);
You need to build the Bulk Builder out of the loop and then use it to add all bills:
bulkRequest = new Bulk.Builder()
for(JSONObject bill : bills) {
bulkRequest.addAction(new Index.Builder(bill.toString()).index(index).type(type).id(id).build())
}
bulkResponse = Client.execute(bulkRequest.build());
I know It's an old question, but just in case someone stumbles across this, here is a java 8/(lambdas) way of doing the same thing.
Client.execute( new Bulk.Builder()
.addAction(
bills.stream()
.map(bill ->
new Index.Builder(bill.toString()
)
.index(index).type(type).id(id).build())
.collect(Collectors.toList())
).build());

Google Appengine Datastore Timeout Exception

We are fetching the list of namespaces from datastore which counts upto 30k.
The cron to fetch namespaces runs daily. But one day it works fine and other day it throws datastore timeout exception.
com.google.appengine.api.datastore.DatastoreTimeoutException: The
datastore operation timed out, or the data was temporarily
unavailable.
Related Code :
DatastoreService ds = DatastoreServiceFactory.getDatastoreService();
FetchOptions options = FetchOptions.Builder.withChunkSize(150);
Query q = new Query(Entities.NAMESPACE_METADATA_KIND);
for (Entity e : ds.prepare(q).asIterable(options)){
// A nonzero numeric id denotes the default namespace;
// see Namespace Queries, below
if (e.getKey().getId() != 0){
continue;
}else{
namespaces.add(e.getKey().getName());
}
}
What could be the issue?
According to official documentation:
DatastoreTimeoutException is thrown when a datastore operation times
out. This can happen when you attempt to put, get, or delete too many
entities or an entity with too many properties, or if the datastore is
overloaded or having trouble.
This means that datastore having troubles with your request. Try to handle that error like:
import com.google.appengine.api.datastore.DatastoreTimeoutException;
try {
// Code that could result in a timeout
} catch (DatastoreTimeoutException e) {
// Display a timeout-specific error page
}

use of multithreading for downloading in java

I'm trying to concurrently download HTML-code of websites whose urls are stored in the database (about 3 millions of entries).
It's obvious that I should use multithreading technology but I get into trouble how to do it in java.
Here's how I used to do it without multithreading:
final Connection c = dbConnect(); // register jdbc-driver and establish connection
checkRequiredDbAndTables(); // here we check the existence of the Db and necessary tables
try {
// now get list of urls from the db
String sql = "select id, website_url, category_id from list_of_websites";
PreparedStatement ps = c.prepareStatement(sql);
ResultSet rs = ps.executeQuery();
while (rs.next()) {
// column numeration in ResultSet is from 1 !
final long id = rs.getInt(1); // get website id
final String url = rs.getString(2); // get website url
System.out.println("Category: " + rs.getString(3) + " " + id + " " + url);
if ( isValidURL(url) && connectionOK(url) ) {
// checked url syntax and connection
String htmlInPage = downloadHTML(url);
if (!htmlInPage.equals("")) {
// add result to db
insertDataToDb( c, id, htmlInPage);
}
}
}
rs.close();
} catch (SQLException e) {
e.printStackTrace();
}
closeConnection(c); // database connection closed
The function donloadHTML uses JSoup library to do the main work.
Feels like my task is a kind of "producer consumer problem". I suppose that it can be represented in a such way: there's a buffer, containing N links; some processes getting the links from it and downloading HTML; and a process, which aim is to load new urls from the db into the buffer as it gets empty.
But I completely don't know how to do it. I've heard of Threads and ExecutorService providing ThreadPools but its really confusing for me.
You may want to use a Thread pool that has fixed amount of thread. Your program will first create a thread pool. Then it will read URLs from database. When a URL is read, the program will start a new task to download its content.
You program may maintain a queue. When a task finish downloading HTMLs, it can push the URL and the result together into a queue. When the main thread finish reading URLs and starting tasks, it can wait for the queue. Once the queue have any responses, take the response out and write it to database. The main thread can count how many responses are received, when it counts to the number of URLs, then all task was finish.
Your program can write a class for storing the response with the URL, for example:
class response {
public String URL;
public String result;
public response(String u, String r) { this.URL = u; this.result = r; }
}
If you still have any problem implementing or understanding ( I may not explain this clear enough, it is 00:40 now and I will probably go to sleep soon. ), please leave comments. If you want code, please also leave comments.
Main thread:
Start X "downloading" threads
Run query shown in question. for each record:
Add data from query to an ArrayBlockingQueue
Add end-of-data marker to queue
Wait for threads to stop (optional)
Return from main
Download thread:
Get data from queue. while not end-of-data marker:
Download HTML
Insert HTML to database
Put end-of-data marker back into queue for other threads to find
Exit thread

Categories