ElasticSearch Multiple Scrolls Java API

ElasticSearch Multiple Scrolls Java API - java

I want to get all data from an index. Since the number of items is too large for memory I use the Scroll (nice function):
client.prepareSearch(index)
.setTypes(myType).setSearchType(SearchType.SCAN)
.setScroll(new TimeValue(60000))
.setSize(amountPerCall)
.setQuery(MatchAll())
.execute().actionGet();
Which works nice when calling:
client.prepareSearchScroll(scrollId)
.setScroll(new TimeValue(600000))
.execute().actionGet()
But, when I call the former method multiple times, I get the same scrollId multiple times, hence I cannot scroll multiple times - in parallel.
I found http://elasticsearch-users.115913.n3.nabble.com/Multiple-scrolls-simultanious-td4024191.html which states that it is possible - though I don't know his affiliation to ES.
Am I doing something wrong?

After searching some more, I got the impression that this (same scrollId) is by design. After the timeout has expired (which is reset after each call Elasticsearch scan and scroll - add to new index).
So you can only get one opened scroll per index.
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-scroll.html states:
Scrolling is not intended for real time user requests, but rather for
processing large amounts of data, e.g. in order to reindex the
contents of one index into a new index with a different configuration.
So it appears what I wanted is not an option, on purpose - possibly because of optimization.
Update
As stated creating multiple scrolls cannot be done, but this is only true when the query you use for scrolling is the same. If you scroll for, for instance, another type, index, or just another query, you can have multiple scrolls

You can scroll the same index in same time, this is what elasticsearch-hadoop does.
Just, don't forget that under the hood, an index is composed of multiple shards that own data, so you can scroll each shards in parallel by using:
.setPreference("_shards:1")

Related

Dataflow Distinct transform example

In my Dataflow pipeline am trying to use a Distinct transform to reduce duplicates. I would like to try applying this to fixed 1min windows initially and use another method to deal with duplicates across the windows. This latter point will likely work best if the 1min windows are real/processing time.
I expect a few 1000 elements, text strings of a few KiB each.
I set up the Window and Distinct transform like this:
PCollection<String>.apply("Deduplication global window", Window
.<String>into(new GlobalWindows())
.triggering(Repeatedly
.forever(AfterProcessingTime
.pastFirstElementInPane()
.plusDelayOf(Duration.standardMinutes(1))
)
)
.withAllowedLateness(Duration.ZERO).discardingFiredPanes()
)
.apply("Deduplicate URLs in window", Distinct.<String>create());
But when I run this on GCP, I see the Distinct transform appears to emit more elements than it receives:
(So by definition they cannot be distinct unless it makes something up!)
More likely I guess I am not setting it up properly. Does anybody have an example of how to do it (I didn't really find much apart from the javadoc)? Thank you.

As you want to remove duplicates within 1 min window;
You can make use of Fixed Windows with the default trigger rather than using a Global window with a Processing time trigger.
Window.<String>into(FixedWindows.of(Duration.standardSeconds(60))));
This followed by the distinct transform will remove any repeated keys within the 1 min window based on event time.

How can I load more items when scrolling down in listview?

First I want to say: This question is no duplicate!
I already read a lot of questions about loading more items when scrolling down.
This was most helpful for me.
But all questions I read aren't explaining the basic principle.
So, what I mean is:
My app gets data from json and displays it in a ListView, but it's not possible to load all items from database with one request. The app crashes…
The solution is to load only 10 items and on scrolling down 10 items again.
But what is the basic principle to do this?
I thought about these 2 different options:
Set a LIMIT in my PHP file and send for each 10 items a new request from android and set LIMIT +10.
Send one request from android and getting all data from json, but only displaying 10 items.
Code isn't necessary, because I want to know the principle of doing this.

The approach I use in my apps is:
Step # 1: I load first set of data with limit from server and then store the last record data id in a variable.
Step # 2: Implement the on scroll end listener to your listView or RecyclerView.
Step # 3: Start another request inside on scroll end listener and load new records.(Here I set again data id in a variable)
For Example:
Start the request when the activity starts and do what I explained earlier.
GetBusinesses("&isOpen=2&cat="+Prefrences.getSelectedCategoryId());
Then inside your on Scroll End Listener
GetBusinesses("&isOpen=2&cat="+Prefrences.getSelectedCategoryId()+"&limit=10&lastDataId="+BusinessItems[index].mBusinessId)
Edit To avoid duplicate api call inside GetBusinesses() check if the request was started previously, well the idea is create a boolean initially false and then in your GetBusinesses() function make it true before starting the request and once the data is loaded and request is finish make it false again.
I hope this help.

Usually you would load 10 new items from the server each time. Use a page-parameter to identify which 10 items you need and where to place them.
loading all items at once could be way too expensive: The delay could be long and the user's data-plan won't be happy either. Obviously depending on how many items there are and what contents they have.

You will have to find the trade-off. Based on your data size, sometimes it makes sense to parse and save it all in local database.
Just for 200-300 records, you don't want to make another api call after every 50 records in list. Remember with mobile app user scrolls up and down very often. You might be unnecessarily sending multiple requests to your server, which might be an overload(depending on user count).
If you go with option 2, you can make use of something like JobIntentService to silently fetch data and save locally.
This approach will also let your user interact with no internet(offline mode) scenarios.

How to refresh jsp page?

I'm new to Google App Engine and I'm having a little problem that I can't seem to be able to find the solution to.
Whenever I create/delete/update something from the Datastore, in the end I do this:
resp.sendRedirect("/view_list.jsp");
And the page doesn't get updated.
For instance, if I have a page with a list of 2 items, then I create another item and I redirect to that page with the list, and instead of showing 3 items, it shows 2 items, until I change page and come back.
So how can I make sure that the page refreshes after my changes to the Datastore?

A couple of points that are relevant:
The Data store is HRD (High Replication Database) and as per the documentation, the delay from the time a write is committed until it becomes visible in all datacenters means that queries across multiple entity groups (non-ancestor queries) can only guarantee eventually consistent results. Consequently, the results of such queries may sometimes fail to reflect recent changes to the underlying data. Please refer to the documentation for more details.
In short, to get consistent reads, use get as much as you can. If you use a query, there could be a delay due to the indexing.
Hope this helps. I also suggest to frame the question title better. The question is a good one but could get lost when it says "refresh the jsp page".

Find and delete duplicates in a Lotus Notes database

I am very new to lotus notes. Recently my team mates were facing a problem regarding the Duplicates in Lotus notes as shown below in the CASE A and CASE B.
So we bought a app named scanEZ (Link About scanEX). Using this tool we can remove the first occurrence or the second occurrence. As in the case A and Case B the second items are considered as redundant because they do not have child. So we can remove all the second item as given below and thus removing the duplicates.
But in the Case 3 the order gets changed, the child item comes first and the Parent items comes second so i am unable to use the scanEX app.
Is there any other better way or software or script to accomplish my task. As I am new to this field I have not idea. Kindly help me.
Thanks in advance.

Probably the easiest way to approach this would be to force the view to always display documents with children first. That way the tool you have purchased will behave consistently for you. You would do this by adding a hidden sorted column to the right of the column that that you have circled. The formula in this column would be #DocChildren, and the sort options for the column would be set to 'Descending'. (Note that if you are uncomfortable making changes in this view, you can make a copy of it, make your changes in the copy, and run ScanEZ against the copy as well. You can also do all of this in a local replica of the database, and only replicate it back to the server when you are satisified that you have the right results.)
The other way would be to write your own code in LotusScript or Java, using the Notes classes. There are many different ways that you could write that code,

I agree with Richard's answer. If you want more details on how to go thru the document collection you could isolate the documents into a view that shows only the duplicates. Then write an agent to look at the UNID of the document, date modified and other such data elements to insure that you are getting the last updated document. I would add a field to the document as in FLAG='keep'. Then delete documents that don't have your flag in the document with a second agent. If you take this approach you can often use the same agents in other databases.
Since you are new to Notes keep in mind that Notes is a document database. There are several different conflicts like save conflicts or replication conflicts. Also you need to look at database settings on how duplicates can be handled. I would read up on these topics just so you can explain it to your co-workers/project manager.
Eventually in your heavily travelled databases you might be able to automate this process after you work down the source of the duplicates.

These are clearly not duplicates.
The definition of duplicate is that they are identical and so it does not matter which one is kept and which one is removed. To you, the fact that one has children makes it more important, which means that they are not pure duplicates.
What you have not stated is what you want to do if multiple documents with similar dates/subjects have children (a case D if you will).
To me this appears as three separate problems.
The first problem is to sort out the cases where more than one
document in a set has children.
Then sort out the cases where only one document in a set has children.
Then sort out the cases where none of the documents in a set has children.
The approach in each case will be different. The article from Ytira only really covers the last of these cases.

Adding lots of items into a ListModel without blocking the UI

I've implemented a search of lots of items (hundreds) in a JList using Lucene - when someone types in the search box it performs a search and displays the results in a JList. It does this by adding and removing the items from the underlying JList model when each character is typed, but this approach blocks the UI (because adding something to a ListModel has to be performed on the EDT.) The search is very quick but it's the adding and removing of items that takes the time.
How would I approach the problem to not block the EDT while the model is being modified?
The length of the lag isn't huge - it's definitely at the state where it's usable at the moment, just not really as snappy as I'd like (for want of a better word.) I'm expecting people on less powerful machines than mine to run the software though hence my interest in sorting the issue.
Other details:
I have profiled the application, the lag is definitely caused by adding / removing lots of items. A typical step could see any number of items getting added or removed, from a few to hundreds. For instance, if I search for the letter "x" in the text box then most of the items will get removed since few contain that letter. If I then remove the letter all the items will be added again. If I search for a more common term, "the" for instance, then just a few items may be removed since the bulk of them contain that term.
I'm not dealing with strings directly, but they're relatively simple objects made up of just a few strings (Song to be precise made up of things like title, author, lyrics etc.), and they're all cached using SoftReferences where possible (so assume none of these objects are being created or destroyed, they shouldn't be for a typical user.)

This may not be the answer you're looking for, but I wonder if your best solution is simply not to add hundreds of items. There's no way that the user will be able to or want to scroll through this many items in a JList, and so perhaps your smartest move is to limit how many items added to a reasonable number, say 20 or so.
I think of this similar to a word processor displaying a document on the screen or other immediate "look-up" components I've used in the past. If the document is large, often the whole thing isn't loaded into memory but rather somehow cached to disk. If you have no choice but to load a lot of items, then perhaps you can take this portion of the model "off-line" show a wait modal dialog, load the items off the EDT and then get the model back on line and then releasing the modal dialog.

I think that easiest way would be to use JTable instead of JList, add RowFilter to JTable, then there aren't reason to add/remove/modify numbers of Items
for add/remove/modify numbers of Items in the XxxModel on the background is there SwingWorker

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

ElasticSearch Multiple Scrolls Java API - java

You can scroll the same index in same time, this is what elasticsearch-hadoop does. Just, don't forget that under the hood, an index is composed of multiple shards that own data, so you can scroll each shards in parallel by using: .setPreference("_shards:1")

Related

Dataflow Distinct transform example

How can I load more items when scrolling down in listview?

How to refresh jsp page?

Find and delete duplicates in a Lotus Notes database

Adding lots of items into a ListModel without blocking the UI

Categories

Resources