How can I optimize my for loop:
public List<XYZ.Users> getUsers(
List<ResponseDocument> responseDocuments
) {
List<XYZ.Users> users = new ArrayList<>();
for (ResponseDocument solrUser : responseDocuments) {
Map<String, Object> field = solrUser.getFields();
String name = field.getOrDefault(Constants.NAME, "").toString();
int classId = (int) field.getOrDefault(Constants.ID, -1);
XYZ.Users user = XYZ.Users.newBuilder()
.setUser(name)
.setId(classId)
.setDir(Constants.DIR)
.build();
users.add(user);
}
return users;
}
I track the time taken by this for loop in production ( 500 requests/sec) and I have seen some random spikes in time? I am not able to figure out why that might happen.
What optimizations would you recommend?
Related
I have around 27,000 documents that I had like to write into my firestore collecton.
In order to do so, I use WriteBatch and I break it every 500 documents.
I get the following error:
java.lang.IllegalStateException: A write batch can no longer be used after commit() has been called.
The code im using is:
private void addData2DB(ArrayList<String> titles, ArrayList<String> authors, ArrayList<String> publishers, ArrayList<String> genres, ArrayList<String> pages) {
db = FirebaseFirestore.getInstance();
WriteBatch batch = db.batch();
double counter = 1.0;
for (int i = 0; i <= titles.size(); i++) {
Map<String, Object> data = new HashMap<>();
String title = titles.get( i );
String[] TitleWords = title.split( "\\s+" );
if (!title.isEmpty()) {
String docID = db.collection( "GeneratingID" ).document().getId();
data.put( "bookAuthor", authors.get( i ) );
data.put( "bookTitle", titles.get( i ) );
data.put( "genre", genres.get( i ) );
data.put( "pages", pages.get( i ) );
data.put( "publishedBy", publishers.get( i ) );
DocumentReference dr = db.collection( "BooksDB" ).document( docID );
batch.set( dr, data );
}
if ((double) i / 500 == counter) {
batch.commit();
counter = counter + 1;
}
}
}
Is there a way to fix this?
Thank you
Move the WriteBatch batch = db.batch(); inside your for loop.
You are getting the following error:
java.lang.IllegalStateException: A write batch can no longer be used after commit() has been called.
Most likely because your code doesn't wait for the first commit to complete. To write 500 documents to Firestore, takes some time, as it's an asynchronous operation. So you get the above error because a batch object can't be modified again after you call commit().
So you need to wait until the operation completes before adding new write operations to the batch and call again "commit()".
I am working with JPA, my web application is taking 60 sec to execute this method, I want to execute it faster how to achive ?
public boolean evaluateStudentTestPaper (long testPostID, long studentID, long howManyTimeWroteExam) {
Gson uday = new Gson();
Logger custLogger = Logger.getLogger("StudentDao.java");
// custLogger.info("evaluateTestPaper test paper for testPostID: " +
// testPostID);
long subjectID = 0;
// checking in table
EntityManagerFactory EMF = EntityManagerFactoryProvider.get();
EntityManager em = EMF.createEntityManager();
List<StudentExamResponse> studentExamResponses = null;
try {
studentExamResponses = em
.createQuery(
"SELECT o FROM StudentExamResponse o where o.studentId=:studentId And o.testPostID=:testPostID and o.howManyTimeWroteExam=:howManyTimeWroteExam")
.setParameter("studentId", studentID).setParameter("testPostID", testPostID)
.setParameter("howManyTimeWroteExam", howManyTimeWroteExam).getResultList();
System.out.println("studentExamResponses--------------------------------------------------"
+ uday.toJson(studentExamResponses) + "---------------------------------------");
} catch (Exception e) {
custLogger.info("exception at getting student details:" + e.toString());
studentExamResponses = null;
}
int studentExamResponseSize = studentExamResponses.size();
if (AppConstants.SHOWLOGS.equalsIgnoreCase("true")) {
custLogger.info("student questions list:" + studentExamResponseSize);
}
// Get all questions based on student id and test post id
List<ExamPaperRequest> examPaperRequestList = new ArrayList<ExamPaperRequest>();
List<Questions> questionsList = new ArrayList<Questions>();
// StudentExamResponse [] studentExamResponsesArgs =
// (StudentExamResponse[]) studentExamResponses.toArray();
// custLogger.info("Total questions to be evaluated: " +
// examPaperRequestList.size());
List<StudentTestResults> studentTestResultsList = new ArrayList<StudentTestResults>();
StudentTestResults studentTestResults = null;
StudentResults studentResults = null;
String subjectnames = "", subjectMarks = "";
int count = 0;
boolean lastIndex = false;
if (studentExamResponses != null && studentExamResponseSize > 0) {
// studentExamResponses.forEach(studentExamResponses->{
for (StudentExamResponse o : studentExamResponses.stream().parallel()) {
// 900 lines of coade inside which includes getting data from database Queries
}
}
As #Nikos Paraskevopoulos mentioned, it should probably be the ~900 * N database iterations inside that for loop.
I'd say to avoid DB iterations as much as you can, specially inside a loop like that.
You can try to elaborate your current StudentExamResponse sql to englobe more clauses - those you're using inside your for mainly, which could even diminish the amount of items you iterate upon.
My guess would be your select query is taking time.
If possible, set query timeout to less than 60 seconds & confirm this.
Ways of setting query timeout can be found out there - How to set the timeout period on a JPA EntityManager query
If this is because of query, then you may need to work to make select query optimal.
I'm trying to download photos posted with specific tag in real time. I found real time api pretty useless so I'm using long polling strategy. Below is pseudocode with comments of sublte bugs in it
newMediaCount = getMediaCount();
delta = newMediaCount - mediaCount;
if (delta > 0) {
// if mediaCount changed by now, realDelta > delta, so realDelta - delta photos won't be grabbed and on next poll if mediaCount didn't change again realDelta - delta would be duplicated else ...
// if photo posted from private account last photo will be duplicated as counter changes but nothing is added to recent
recentMedia = getRecentMedia(delta);
// persist recentMedia
mediaCount = newMediaCount;
}
Second issue can be addressed with Set of some sort I gueess. But first really bothers me. I've moved two calls to instagram api as close as possible but is this enough?
Edit
As Amir suggested I've rewritten the code with use of min/max_tag_ids. But it still skips photos. I couldn't find better way to test this than save images on disk for some time and compare result to instagram.com/explore/tags/.
public class LousyInstagramApiTest {
#Test
public void testFeedContinuity() throws Exception {
Instagram instagram = new Instagram(Settings.getClientId());
final String TAG_NAME = "portrait";
String id = instagram.getRecentMediaTags(TAG_NAME).getPagination().getMinTagId();
HashtagEndpoint endpoint = new HashtagEndpoint(instagram, TAG_NAME, id);
for (int i = 0; i < 10; i++) {
Thread.sleep(3000);
endpoint.recentFeed().forEach(d -> {
try {
URL url = new URL(d.getImages().getLowResolution().getImageUrl());
BufferedImage img = ImageIO.read(url);
ImageIO.write(img, "png", new File("D:\\tmp\\" + d.getId() + ".png"));
} catch (Exception e) {
e.printStackTrace();
}
});
}
}
}
class HashtagEndpoint {
private final Instagram instagram;
private final String hashtag;
private String minTagId;
public HashtagEndpoint(Instagram instagram, String hashtag, String minTagId) {
this.instagram = instagram;
this.hashtag = hashtag;
this.minTagId = minTagId;
}
public List<MediaFeedData> recentFeed() throws InstagramException {
TagMediaFeed feed = instagram.getRecentMediaTags(hashtag, minTagId, null);
List<MediaFeedData> dataList = feed.getData();
if (dataList.size() == 0) return Collections.emptyList();
String maxTagId = feed.getPagination().getNextMaxTagId();
if (maxTagId != null && maxTagId.compareTo(minTagId) > 0) dataList.addAll(paginateFeed(maxTagId));
Collections.reverse(dataList);
// dataList.removeIf(d -> d.getId().compareTo(minTagId) < 0);
minTagId = feed.getPagination().getMinTagId();
return dataList;
}
private Collection<? extends MediaFeedData> paginateFeed(String maxTagId) throws InstagramException {
System.out.println("pagination required");
List<MediaFeedData> dataList = new ArrayList<>();
do {
TagMediaFeed feed = instagram.getRecentMediaTags(hashtag, null, maxTagId);
maxTagId = feed.getPagination().getNextMaxTagId();
dataList.addAll(feed.getData());
} while (maxTagId.compareTo(minTagId) > 0);
return dataList;
}
}
Using the Tag endpoints to get the recent media with a desired tag, it returns a min_tag_id in its pagination info, which is tied to the most recently tagged media at the time of your call. As the API also accepts a min_tag_id parameter, you can pass that number from your last query to only receive those media that are tagged after your last query.
So based on whatever polling mechanism you have, you just call the API to get the new recent media if any based on last received min_tag_id.
You will also need to pass a large count parameter and follow the pagination of the response to receive all data without losing anything when the speed of tagging is faster than your polling.
Update:
Based on your updated code:
public List<MediaFeedData> recentFeed() throws InstagramException {
TagMediaFeed feed = instagram.getRecentMediaTags(hashtag, minTagId, null, 100000);
List<MediaFeedData> dataList = feed.getData();
if (dataList.size() == 0) return Collections.emptyList();
// follow the pagination
MediaFeed recentMediaNextPage = instagram.getRecentMediaNextPage(feed.getPagination());
while (recentMediaNextPage.getPagination() != null) {
dataList.addAll(recentMediaNextPage.getData());
recentMediaNextPage = instagram.getRecentMediaNextPage(recentMediaNextPage.getPagination());
}
Collections.reverse(dataList);
minTagId = feed.getPagination().getMinTagId();
return dataList;
}
I want to have Solr always retrieve all documents found by a search (I know Solr wasn't built for that, but anyways) and I am currently doing this with this code:
...
List<Article> ret = new ArrayList<Article>();
QueryResponse response = solr.query(query);
int offset = 0;
int totalResults = (int) response.getResults().getNumFound();
List<Article> ret = new ArrayList<Article>((int) totalResults);
query.setRows(FETCH_SIZE);
while(offset < totalResults) {
//requires an int? wtf?
query.setStart((int) offset);
int left = totalResults - offset;
if(left < FETCH_SIZE) {
query.setRows(left);
}
response = solr.query(query);
List<Article> current = response.getBeans(Article.class);
offset += current.size();
ret.addAll(current);
}
...
This works, but is pretty slow if a query gets over 1000 hits (I've read about that on here. This is being caused by Solr because I am setting the start everytime which - for some reason - takes some time). What would be a nicer (and faster) ways to do this?
To improve the suggested answer you could use a streamed response. This has been added especially for the case that one fetches all results. As you can see in Solr's Jira that guy wants to do the same as you do. This has been implemented for Solr 4.
This is also described in Solrj's javadoc.
Solr will pack the response and create a whole XML/JSON document before it starts sending the response. Then your client is required to unpack all that and offer it as a list to you. By using streaming and parallel processing, which you can do when using such a queued approach, the performance should improve further.
Yes, you will loose the automatic bean mapping, but as performance is a factor here, I think this is acceptable.
Here is a sample unit test:
public class StreamingTest {
#Test
public void streaming() throws SolrServerException, IOException, InterruptedException {
HttpSolrServer server = new HttpSolrServer("http://your-server");
SolrQuery tmpQuery = new SolrQuery("your query");
tmpQuery.setRows(Integer.MAX_VALUE);
final BlockingQueue<SolrDocument> tmpQueue = new LinkedBlockingQueue<SolrDocument>();
server.queryAndStreamResponse(tmpQuery, new MyCallbackHander(tmpQueue));
SolrDocument tmpDoc;
do {
tmpDoc = tmpQueue.take();
} while (!(tmpDoc instanceof PoisonDoc));
}
private class PoisonDoc extends SolrDocument {
// marker to finish queuing
}
private class MyCallbackHander extends StreamingResponseCallback {
private BlockingQueue<SolrDocument> queue;
private long currentPosition;
private long numFound;
public MyCallbackHander(BlockingQueue<SolrDocument> aQueue) {
queue = aQueue;
}
#Override
public void streamDocListInfo(long aNumFound, long aStart, Float aMaxScore) {
// called before start of streaming
// probably use for some statistics
currentPosition = aStart;
numFound = aNumFound;
if (numFound == 0) {
queue.add(new PoisonDoc());
}
}
#Override
public void streamSolrDocument(SolrDocument aDoc) {
currentPosition++;
System.out.println("adding doc " + currentPosition + " of " + numFound);
queue.add(aDoc);
if (currentPosition == numFound) {
queue.add(new PoisonDoc());
}
}
}
}
You might improve performance by increasing FETCH_SIZE. Since you are getting all the results, pagination doesn't make sense unless you are concerned with memory or some such. If 1000 results are liable to cause a memory overflow, I'd say your current performance seems pretty outstanding though.
So I would try getting everything at once, simplifying this to something like:
//WHOLE_BUNCHES is a constant representing a reasonable max number of docs we want to pull here.
//Integer.MAX_VALUE would probably invite an OutOfMemoryError, but that would be true of the
//implementation in the question anyway, since they were still being stored in the list at the end.
query.setRows(WHOLE_BUNCHES);
QueryResponse response = solr.query(query);
int totalResults = (int) response.getResults().getNumFound(); //If you even still need this figure.
List<Article> ret = response.getBeans(Article.class);
If you need to keep the pagination though:
You are performing this first query:
QueryResponse response = solr.query(query);
and are populating the number of found results from it, but you are not pulling any results with the response. Even if you keep pagination here, you could at least eliminate one extra query here.
This:
int left = totalResults - offset;
if(left < FETCH_SIZE) {
query.setRows(left);
}
Is unnecessary. setRows specifies a Maximum number of rows to return, so asking for more than are available won't cause any problems.
Finally, apropos of nothing, but I have to ask: what argument would you expect setStart to take if not an int?
Use below logic to fetch solr data as batch to optimize performance of solr data fetch query:
public List<Map<String, Object>> getData(int id,Set<String> fields){
final int SOLR_QUERY_MAX_ROWS = 3;
long start = System.currentTimeMillis();
SolrQuery query = new SolrQuery();
String queryStr = "id:" + id;
LOG.info(queryStr);
query.setQuery(queryStr);
query.setRows(SOLR_QUERY_MAX_ROWS);
QueryResponse rsp = server.query(query, SolrRequest.METHOD.POST);
List<Map<String, Object>> mapList = null;
if (rsp != null) {
long total = rsp.getResults().getNumFound();
System.out.println("Total count found: " + total);
// Solr query batch
mapList = new ArrayList<Map<String, Object>>();
if (total <= SOLR_QUERY_MAX_ROWS) {
addAllData(mapList, rsp,fields);
} else {
int marker = SOLR_QUERY_MAX_ROWS;
do {
if (rsp != null) {
addAllData(mapList, rsp,fields);
}
query.setStart(marker);
rsp = server.query(query, SolrRequest.METHOD.POST);
marker = marker + SOLR_QUERY_MAX_ROWS;
} while (marker <= total);
}
}
long end = System.currentTimeMillis();
LOG.debug("SOLR Performance: getData: " + (end - start));
return mapList;
}
private void addAllData(List<Map<String, Object>> mapList, QueryResponse rsp,Set<String> fields) {
for (SolrDocument sdoc : rsp.getResults()) {
Map<String, Object> map = new HashMap<String, Object>();
for (String field : fields) {
map.put(field, sdoc.getFieldValue(field));
}
mapList.add(map);
}
}
I have a database in elastic search and want to get all records on my web site page. I wrote a bean, which connects to the elastic search node, searches records and returns some response. My simple java code, which does the searching, is
SearchResponse response = getClient().prepareSearch(indexName)
.setTypes(typeName)
.setQuery(queryString("\*:*"))
.setExplain(true)
.execute().actionGet();
But Elasticsearch set default size to 10 and I have 10 hits in response. There are more than 10 records in my database. If I set size to Integer.MAX_VALUE my search becomes very slow and this not what I want.
How can I get all the records in one action in an acceptable amount of time without setting size of response?
public List<Map<String, Object>> getAllDocs(){
int scrollSize = 1000;
List<Map<String,Object>> esData = new ArrayList<Map<String,Object>>();
SearchResponse response = null;
int i = 0;
while( response == null || response.getHits().hits().length != 0){
response = client.prepareSearch(indexName)
.setTypes(typeName)
.setQuery(QueryBuilders.matchAllQuery())
.setSize(scrollSize)
.setFrom(i * scrollSize)
.execute()
.actionGet();
for(SearchHit hit : response.getHits()){
esData.add(hit.getSource());
}
i++;
}
return esData;
}
The current highest-ranked answer works, but it requires loading the whole list of results in memory, which can cause memory issues for large result sets, and is in any case unnecessary.
I created a Java class that implements a nice Iterator over SearchHits, that allows to iterate through all results. Internally, it handles pagination by issuing queries that include the from: field, and it only keeps in memory one page of results.
Usage:
// build your query here -- no need for setFrom(int)
SearchRequestBuilder requestBuilder = client.prepareSearch(indexName)
.setTypes(typeName)
.setQuery(QueryBuilders.matchAllQuery())
SearchHitIterator hitIterator = new SearchHitIterator(requestBuilder);
while (hitIterator.hasNext()) {
SearchHit hit = hitIterator.next();
// process your hit
}
Note that, when creating your SearchRequestBuilder, you don't need to call setFrom(int), as this will be done interally by the SearchHitIterator. If you want to specify the size of a page (i.e. the number of search hits per page), you can call setSize(int), otherwise ElasticSearch's default value is used.
SearchHitIterator:
import java.util.Iterator;
import org.elasticsearch.action.search.SearchRequestBuilder;
import org.elasticsearch.action.search.SearchResponse;
import org.elasticsearch.search.SearchHit;
public class SearchHitIterator implements Iterator<SearchHit> {
private final SearchRequestBuilder initialRequest;
private int searchHitCounter;
private SearchHit[] currentPageResults;
private int currentResultIndex;
public SearchHitIterator(SearchRequestBuilder initialRequest) {
this.initialRequest = initialRequest;
this.searchHitCounter = 0;
this.currentResultIndex = -1;
}
#Override
public boolean hasNext() {
if (currentPageResults == null || currentResultIndex + 1 >= currentPageResults.length) {
SearchRequestBuilder paginatedRequestBuilder = initialRequest.setFrom(searchHitCounter);
SearchResponse response = paginatedRequestBuilder.execute().actionGet();
currentPageResults = response.getHits().getHits();
if (currentPageResults.length < 1) return false;
currentResultIndex = -1;
}
return true;
}
#Override
public SearchHit next() {
if (!hasNext()) return null;
currentResultIndex++;
searchHitCounter++;
return currentPageResults[currentResultIndex];
}
}
In fact, realizing how convenient it is to have such a class, I wonder why ElasticSearch's Java client does not offer something similar.
You could use scrolling API.
The other suggestion of using a searchhit iterator would also work great,but only when you don't want to update those hits.
import static org.elasticsearch.index.query.QueryBuilders.*;
QueryBuilder qb = termQuery("multi", "test");
SearchResponse scrollResp = client.prepareSearch(test)
.addSort(FieldSortBuilder.DOC_FIELD_NAME, SortOrder.ASC)
.setScroll(new TimeValue(60000))
.setQuery(qb)
.setSize(100).execute().actionGet(); //max of 100 hits will be returned for each scroll
//Scroll until no hits are returned
do {
for (SearchHit hit : scrollResp.getHits().getHits()) {
//Handle the hit...
}
scrollResp = client.prepareSearchScroll(scrollResp.getScrollId()).setScroll(new TimeValue(60000)).execute().actionGet();
} while(scrollResp.getHits().getHits().length != 0); // Zero hits mark the end of the scroll and the while loop.
It has been long since you asked this question, and I would like to post my answer for future readers.
As mentioned above answers, it is better to load documents with size and start when you have thousands of documents in the index. In my project, the search loads 50 results as default size and starting from zero index, if a user wants to load more data, then the next 50 results will be loaded. Here what I have done in the code:
public List<CourseDto> searchAllCourses(int startDocument) {
final int searchSize = 50;
final SearchRequest searchRequest = new SearchRequest("course_index");
final SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
searchSourceBuilder.query(QueryBuilders.matchAllQuery());
if (startDocument != 0) {
startDocument += searchSize;
}
searchSourceBuilder.from(startDocument);
searchSourceBuilder.size(searchSize);
// sort the document
searchSourceBuilder.sort(new FieldSortBuilder("publishedDate").order(SortOrder.ASC));
searchRequest.source(searchSourceBuilder);
List<CourseDto> courseList = new ArrayList<>();
try {
final SearchResponse searchResponse = client.search(searchRequest, RequestOptions.DEFAULT);
final SearchHits hits = searchResponse.getHits();
// Do you want to know how many documents (results) are returned? here is you get:
TotalHits totalHits = hits.getTotalHits();
long numHits = totalHits.value;
final SearchHit[] searchHits = hits.getHits();
final ObjectMapper mapper = new ObjectMapper();
for (SearchHit hit : searchHits) {
// convert json object to CourseDto
courseList.add(mapper.readValue(hit.getSourceAsString(), CourseDto.class));
}
} catch (IOException e) {
logger.error("Cannot search by all mach. " + e);
}
return courseList;
}
Info:
- Elasticsearch version 7.5.0
- Java High Level REST Client is used as client.
I Hope, this will be useful for someone.
You will have to trade off the number of returned results vs the time you want the user to wait and the amount of available server memory. If you've indexed 1,000,000 documents, there isn't a realistic way to retrieve all those results in one request. I'm assuming your results are for one user. You'll have to consider how the system will perform under load.
To query all, you should build a CountRequestBuilder to get the total number of records (by CountResponse) then set the number back to size of your seach request.
If your primary focus is on exporting all records you might want to go for a solution which does not require any kind of sorting, as sorting is an expensive operation.
You could use the scan and scroll approach with ElasticsearchCRUD as described here.
For version 6.3.2, the following worked:
public List<Map<String, Object>> getAllDocs(String indexName, String searchType) throws FileNotFoundException, UnsupportedEncodingException{
int scrollSize = 1000;
List<Map<String,Object>> esData = new ArrayList<>();
SearchResponse response = null;
int i=0;
response = client.prepareSearch(indexName)
.setScroll(new TimeValue(60000))
.setTypes(searchType) // The document types to execute the search against. Defaults to be executed against all types.
.setQuery(QueryBuilders.matchAllQuery())
.setSize(scrollSize).get(); //max of 100 hits will be returned for each scroll
//Scroll until no hits are returned
do {
for (SearchHit hit : response.getHits().getHits()) {
++i;
System.out.println (i + " " + hit.getId());
writer.println(i + " " + hit.getId());
}
System.out.println(i);
response = client.prepareSearchScroll(response.getScrollId()).setScroll(new TimeValue(60000)).execute().actionGet();
} while(response.getHits().getHits().length != 0); // Zero hits mark the end of the scroll and the while loop.
return esData;
}
SearchResponse response = restHighLevelClient.search(new SearchRequest("Index_Name"), RequestOptions.DEFAULT);
SearchHit[] hits = response.getHits().getHits();
1.set the max size ,e.g: MAX_INT_VALUE;
private static final int MAXSIZE=1000000;
#Override
public List getAllSaleCityByCity(int cityId) throws Exception {
List<EsSaleCity> list=new ArrayList<EsSaleCity>();
Client client=EsFactory.getClient();
SearchResponse response= client.prepareSearch(getIndex(EsSaleCity.class)).setTypes(getType(EsSaleCity.class)).setSize(MAXSIZE)
.setQuery(QueryBuilders.filteredQuery(QueryBuilders.matchAllQuery(), FilterBuilders.boolFilter()
.must(FilterBuilders.termFilter("cityId", cityId)))).execute().actionGet();
SearchHits searchHits=response.getHits();
SearchHit[] hits=searchHits.getHits();
for(SearchHit hit:hits){
Map<String, Object> resultMap=hit.getSource();
EsSaleCity saleCity=setEntity(resultMap, EsSaleCity.class);
list.add(saleCity);
}
return list;
}
2.count the ES before you search
CountResponse countResponse = client.prepareCount(getIndex(EsSaleCity.class)).setTypes(getType(EsSaleCity.class)).setQuery(queryBuilder).execute().actionGet();
int size = (int)countResponse.getCount();//this is you want size;
then you can
SearchResponse response= client.prepareSearch(getIndex(EsSaleCity.class)).setTypes(getType(EsSaleCity.class)).setSize(size);