object search optimization in memory - java

I have an ArrayList of MyObjects. The MyObjects class has more than 10 properties, but I need to search only from 4 properties. The user will press a button and is able to select values for property1.
Lets say the user will select property1Value1, property1Value2 and property1Value4, than he will press a button and will make a selection for property2 values: property2Value1, property2Value5, property2Value7 and so on. Those are the filter1 and filter2.
The property2Value2, property2Value3 and property2Value4 is not visible to user because he filtered out with the filter1. Is like doing a search before he enter to a new filter screen.
I need to store somewhere what has he selected at each filter because when he navigate back I must show to him the selected values.
I think easier to understand with pictures, since similar in implemented at ebay:
No filters at beginning: user able to select all values for each property:
The user selected "Tablet" for type property. - a search is done and some property values aren't visible anymore:
The second filter value is selected:
Pressing ( automatically) the search I should do something like this in SQL:
SELECT * FROM MyObjects WHERE
( (property1 = property1Value1) || (property1 = property1Value2) || (property1 = property1Value4) )
AND
( (property2 = property2Value1) || (property2 = property2Value5) )
Since I have the objects in memory I don't think is a good idea to make an sqlLite3 database, write out than select. At iOS implementation I did very complex caching algorithm. Caching the filter values separated. A loooooot of auxiliary index holders(min 20), because for each filter I need some extra to do, not mentioned here and the data only once are stored.
I am scared to rewrite that algorithm to Android, what is at iOS, must be something easy.
Edit:
Basically I need to rewrite that SQL search in Java object searching.
Edit2:
Based on answer with Multimap.
The Multimap is not better than a HashMap<String, <ArrarList<Integer>>
where the key is the value of property (property2Value3) and the value is a list of index to my ArrayList<MyObjects> (1,2,3,4,5...100)
Need to build up at each filter, each filter value the HashMap<String, <ArrarList<Integer>> and than exactly there I am, where the iOS...maybe with a few auxiliary collections less.
Any idea?

What you're talking about is basically indexing. A similar approach to what you describe is perfectly manageable in Java, it just takes the same careful coding it would in Objective C.
You haven't specified much about questions like whether multiple items are allowed to have the same values in their fields, so I'll presume they are. In that case, here's how I'd start:
Use Guava's Multimap, probably HashMultimap, where the key is the property being index and each object being indexed gets put into the map under that key.
When you're trying to search on multiple fields, call multimap.get(property) to get a Collection of all of the objects that match that property and keep only the objects that match all the properties:
Set<Item> items = new Set<Items>(typeMultimap.get("tablet"));
items.retainAll(productLineMultimap.get("Galaxy Tab"));
// your results are now in "items"
If your property list is stable, write a wrapper Indexer class that has fields for all of the Multimaps and ensures that objects are inserted into and removed from all of the property indexes, and maybe has convenience wrappers for the map getters.

How it executes the MYSQL that SQL behind of scene? - At a MyISAM table has a file, where he has the data, at other file the id positions.
SELECT * FROM mytable will put all IDs to the result set, because there is no filter.
Because is the * will copy all fields to id. This is equivalent with:
ArrayList<MyObject> result = new ArrayList<MyObject>();
for(int i=0; i < listMyObjects.size(); i++){
if(true == true){// SELECT * FROM has a hidden WHERE 1, which is always true
result.add(listMyObjects.get(i));
}
}
in case of filter it should have a list of filters:
ArrayList<String> filterByProperty1 = new ArrayList<String> ();
at filter interface I will add some Strings property1Value1, property1Value2.... The search algorithm it will be:
ArrayList<MyObject> result = new ArrayList<MyObject>();
for(int i=0; i < listMyObjects.size(); i++){
MyObject curMyObject = listMyObjects.get(i);
// lets see if bypass the filter, if filter exists
boolean property1Allow = false;
boolean property2Allow = false;
if(filterByProperty1.size() > 0){
String theCurProperty1Value = curMyObject.getProperty1();
if(filterByProperty1.contains(theCurPropertyValue)){
property1Allow = true;
}
}
else{// no filter by property1: allowed to add to result
property1Allow = true;
}
// do the same with property2,3,4, lazzy to write it
if(property1Allow && property2Allow){
result.add(theCurPropertyValue);
}
}
}
Not sure if this is a lot slower, but I least I have escaped from tenth / hundred of auxiliary collections, indexes. After this I will make the extra stuff required and is done

Related

How to get first or last item from cqengine IndexedCollection with NavigableIndex

I have com.googlecode.cqengine.IndexedCollection object with NavigableIndex configured. I need to get first or last item from the index or iterator of the index in general.
I suppose this should be trivial. I know I can create Query object with queryOptions object, use it to retrieve iterator from IndexedCollection and get first object, but I'm not sure if it's optimal for performance. Surely it's not elegant.
With help of miradham I figured out that I need to remember indexes, since it's hard to pick up the right one if we have more of them. It will only work with NavigableIndex, we can't iterate base class Index
collection = new ConcurrentIndexedCollection<Data>();
index = NavigableIndex.onAttribute(Data.UNIQUE_TIMESTAMP);
collection.addIndex(index);
when I have the index:
try (CloseableIterator<KeyValue<String, Data>> iterator = indexUniqueTimestamp.getKeysAndValuesDescending(null).iterator()) {
if (iterator.hasNext())
return iterator.next().getValue();
}
return null;
One trick to retrieve the min or max (i.e first or last) object according on one of its attributes, is to use an all() query (which matches all objects in the collection), and to request that results should be returned in ascending or descending order of your attribute.
For example, if you had a collection of Car objects, you could use the following code to retrieve the car which has the highest (i.e. the max) price:
try (ResultSet<Car> results = cars.retrieve(
all(Car.class),
queryOptions(
orderBy(descending(Car.PRICE)),
applyThresholds(
threshold(INDEX_ORDERING_SELECTIVITY, 1.0)
)
))) {
results.stream()
.limit(1)
.forEach(System.out::println);
}
You can also change the limit to something other than 1, in case you want the top n most expensive cars to be returned.
The code above will work regardless of whether or not you actually have a NavigableIndex on the price. The bit about INDEX_ORDERING_SELECTIVITY is to actually request CQEngine to leverage the index (more details here).
or iterator of the index in general
You can use getIndexes() API of QueryEngine interface to retrieve set of Indexes.
Example code:
IndexedCollection<Car> indexedCollection = new ConcurrentIndexedCollection<Car>();
indexedCollection.addIndex(HashIndex.onAttribute(Car.CAR_ID), noQueryOptions());
List<Index<Car>> indexes = new ArrayList<Index<Car>>();
for (Index<Car> index : indexedCollection.getIndexes()) {
indexes.add(index);
}
NavigableIndex stores object in element in Map with attribute as key and set of object as value.
NavigableIndex does not maintain insertion order. First element of the index could be anything.
CQEngine is best designed for random access of object in collection not sequential.
Normal collections in java is best suited for sequence access with index.
one elegant way of accessing first element is to create SequentialIndex class and add it to concurrent collection. retrieve element using index as query.

jOOQ: returning list with join,groupby and count in single object

Core question: how do you properly fetch information from a query into objects?
Idea
I am creating functions in my DAO, which comes down to the following query:
select A.*, count(*)
from A
left join B on B.aId = A.aId
group by A.*
Im looking for a way to create a jOOQ expression that just gives me a list (or something I can loop over) with objects A (pojo) and Integer.
Concrete case
In my code case: A = Volunteer and B = VolunteerMatch where I store several matches for each volunteer. B has (volunteerId, volunteerMatchId) as primary
key. Thus this query results in both the information from the Volunteer, as well as the number of matches. Clearly this can be done in two seperate queries, but I want to do it as one!
Problem
I cannot find a single object to return in my function. I am trying to get something like List<VolunteerPojo, Integer>. Let me explain this better using examples and why they dont fit for me.
What I tried 1
SelectHavingStep<Record> query = using(configuration())
.select(Volunteer.VOLUNTEER.fields())
.select(Volunteermatch.VOLUNTEERMATCH.VOLUNTEERID.count())
.from(Volunteer.VOLUNTEER)
.leftJoin(Volunteermatch.VOLUNTEERMATCH).on(Volunteermatch.VOLUNTEERMATCH.VOLUNTEERID.eq(Volunteer.VOLUNTEER.VOLUNTEERID))
.groupBy(Volunteer.VOLUNTEER.fields());
Map<VolunteerPojo, List<Integer>> map = query.fetchGroups(
r -> r.into(Volunteer.VOLUNTEER).into(VolunteerPojo.class),
r -> r.into(Volunteermatch.VOLUNTEERMATCH.VOLUNTEERID.count()).into(Integer.class)
);
The problem with this, is that I create a List from the integers. But that is not what I want, I want a single integer (the count will always return one row). Note: I don't want the solution "just create your own map without list", since my gut says there is a solution inside jOOQ. Im here to learn!
What I tried 2
SelectHavingStep<Record> query = using(configuration())
.select(Volunteer.VOLUNTEER.fields())
.select(Volunteermatch.VOLUNTEERMATCH.VOLUNTEERID.count())
.from(Volunteer.VOLUNTEER)
.leftJoin(Volunteermatch.VOLUNTEERMATCH).on(Volunteermatch.VOLUNTEERMATCH.VOLUNTEERID.eq(Volunteer.VOLUNTEER.VOLUNTEERID))
.groupBy(Volunteer.VOLUNTEER.fields());
Result<Record> result = query.fetch();
for (Record r : result) {
VolunteerPojo volunteerPojo = r.into(Volunteer.VOLUNTEER).into(VolunteerPojo.class);
Integer count = r.into(Volunteermatch.VOLUNTEERMATCH.VOLUNTEERID.count()).into(Integer.class);
}
However, I do not want to return the result object in my code. On each place I call this function, I am calling the r.into(...).into(...). During compile time, this won't give an error if it returns an integer or a real pojo. I don't want this to prevent future errors. But at least it doesn't give it in a List I suppose.
Reasoning
Either option is probably fine, but I have the feeling there is something better that I missed in the documentation. Maybe I can adapt (1) to not get a list of integers. Maybe I can change Result<Record> into something like Result<VolunteerPojo, Integer> to indicate what objects really are returned. A solution for each problem would be nice, since I am using jOOQ more and more and this would be a good learning experience!
So close! Don't use ResultQuery.fetchGroups(). Use ResultQuery.fetchMap() instead:
Map<VolunteerPojo, Integer> map =
using(configuration())
.select(VOLUNTEER.fields())
.select(VOLUNTEERMATCH.VOLUNTEERID.count())
.from(VOLUNTEER)
.leftJoin(VOLUNTEERMATCH)
.on(VOLUNTEERMATCH.VOLUNTEERID.eq(VOLUNTEER.VOLUNTEERID))
.groupBy(VOLUNTEER.fields())
.fetchMap(
r -> r.into(VOLUNTEER).into(VolunteerPojo.class),
r -> r.get(VOLUNTEERMATCH.VOLUNTEERID.count())
);

Cull all duplicates in a set

I'm using Set to isolate the unique values of a List (in this case, I'm getting a set of points):
Set<PVector> pointSet = new LinkedHashSet<PVector>(listToCull);
This will return a set of unique points, but for every item in listToCull, I'd like to test the following: if there is a duplicate, cull all of the duplicate items. In other words, I want pointSet to represent the set of items in listToCull which are already unique (every item in pointSet had no duplicate in listToCull). Any ideas on how to implement?
EDIT - I think my first question needs more clarification. Below is some code which will execute what I'm asking for, but I'd like to know if there is a faster way. Assuming listToCull is a list of PVectors with duplicates:
Set<PVector> pointSet = new LinkedHashSet<PVector>(listToCull);
List<PVector> uniqueItemsInListToCull = new ArrayList<PVector>();
for(PVector pt : pointSet){
int counter=0;
for(PVector ptCheck : listToCull){
if(pt==ptCheck){
counter++;
}
}
if(counter<2){
uniqueItemsInListToCull.add(pt);
}
}
uniqueItemsInListToCull will be different from pointSet. I'd like to do this without loops if possible.
You will have to do some programming yourself: Create two empty sets; on will contain the unique elements, the other the duplicates. Then loop through the elements of listToCull. For each element, check whether it is in the duplicate set. If it is, ignore it. Otherwise, check if it is in the unique element set. If it is, remove it there and add to the duplicates set. Otherwise, add it to the unique elements set.
If your PVector class has a good hashCode() method, HashSets are quite efficient, so the performance of this will not be too bad.
Untested:
Set<PVector> uniques = new HashSet<>();
Set<PVector> duplicates = new HashSet<>();
for (PVector p : listToCull) {
if (!duplicates.contains(p)) {
if (uniques.contains(p)) {
uniques.remove(p);
duplicates.add(p);
}
else {
uniques.add(p);
}
}
}
Alternatively, you may use a third-party library which offers a Bag or MultiSet. This allows you to count how many occurrences of each element are in the collection, and then at the end discard all elements where the count is different than 1.
What you are looking for is the intersection:
Assuming that PVector (terrible name by the way) implements hashCode() and equals() correctly a Set will eliminate duplicates.
If you want a intersection of the List and an existing Set create a Set from the List then use Sets.intersection() from Guava to get the ones common to both sets.
public static <E> Sets.SetView<E> intersection(Set<E> set1, Set<?> set2)
Returns an unmodifiable view of the intersection of two sets. The returned set contains all
elements that are contained by both backing sets. The iteration order
of the returned set matches that of set1. Results are undefined if
set1 and set2 are sets based on different equivalence relations (as
HashSet, TreeSet, and the keySet of an IdentityHashMap all are).
Note: The returned view performs slightly better when set1 is the
smaller of the two sets. If you have reason to believe one of your
sets will generally be smaller than the other, pass it first.
Unfortunately, since this method sets the generic type of the returned
set based on the type of the first set passed, this could in rare
cases force you to make a cast, for example:
Set aFewBadObjects = ... Set manyBadStrings =
...
// impossible for a non-String to be in the intersection
SuppressWarnings("unchecked") Set badStrings = (Set)
Sets.intersection(
aFewBadObjects, manyBadStrings); This is unfortunate, but should come up only very rarely.
You can also do union, complement, difference and cartesianProduct as well as filtering very easily.
So you want pointSet to hold the items in listToCull which have no duplicates? Is that right?
I would be inclined to create a Map, then iterate twice over the list, the first time putting a value of zero in for each PVector, the second time adding one to the value for each PVector, so at the end you have a map with counts. Now you're interested in the keys of the map for which the value is exactly equal to one.
It's not perfectly efficient - you're operating on list items more times than absolutely necessary - but it's quite clean and simple.
OK, here's the solution I've come up with, I'm sure there are better ones out there but this one's working for me. Thanks to all who gave direction!
To get unique items, you can run a Set, where listToCull is a list of PVectors with duplicates:
List<PVector> culledList = new ArrayList<PVector>();
Set<PVector> pointSet = new LinkedHashSet<PVector>(listToCull);
culledList.addAll(pointSet);
To go further, suppose you want a list where you've removed all items in listToCull which have a duplicate. You can iterate through the list and test whether it's in the set for each item. This let's us do one loop, rather than a nested loop:
Set<PVector> pointSet = new HashSet<PVector>(listToCull);
Set<PVector> removalList = new HashSet<PVector>();//list to remove
for (PVector pt : listToCull) {
if (pointSet.contains(pt)) {
removalList.add(pt);
}
else{
pointSet.add(pt);
}
}
pointSet.removeAll(removalList);
List<PVector> onlyUniquePts = new ArrayList<PVector>();
onlyUniquePts.addAll(pointSet);

Creating a HashMap as an index for title keywords to improve search efficiency

I have a custom class Disks which stores various information of CDs such as their Title, Length, Artist etc. These Disks objects are stored in an ArrayList which can only have elements of Disks added. I am using a method to search for these objects based on matching their title. It takes a user input and then goes through each element of the list and compares the user keyword and the Title of the CD. If it is a complete match, its information is then returned to the user.
I want to change this search mechanization slightly by incorporating a HashMap. I am looking to tokenize each Disks Title and then create a mapping entry for the keyword.
Here is an example: The word "Cars" appears in the titles of the ArrayList elements at position 0,5,7. I want to be able to create a mapping entry for "Cars" which will be a list [0,5,7]. If another element is added to the ArrayList at position 10 with "Cars" in the title, how would I amend the old mapping entry so the new list would be [0,5,7,10]?
In the end I want the user to search for title keywords “Loud Cars”. I will first find "loud" in the index to get a list of [0,7,5] (for example), and then find "cars" to get a list of [0,5,7,10]. Then, I will find where these lists intersect and return the ArrayList elements that correspond to these locations.
My current HashMap declartion looks like this: public HashMap<String, ArrayList<Integer>> map = new HashMap<>(); however even when the Key is different, the values stored in the ArrayList are the same because there is only one of them.
My Disks ArrayList is: public ArrayList<Disks> items; Would there be a way to incorporate this ArrayList into the Value of the HashMap?
Add a new value to the index entry for "Cars"
map.get("Cars").add(10);
Safe way to do this (key = "Cars", index = 10):
ArrayList<Integer> entry = map.get(key);
if (entry == null) {
entry = new ArrayList<Integer>();
map.put(key, entry);
}
entry.add(index);
Instead of using
HashMap<String, ArrayList<Integer>>
I'd recommend
HashMap<String, HashSet<Integer>>
Which is automatically avoids duplicates.
When you search for multiple words, use retainAll to build the intersection of multiple sets (but copy the first set because retainAll is destructive):
Set<Integer> resultSet = new HashSet<Integer>();
resultSet.addAll(map.get("Cars"));
resultSet.retainAll(map.get("Loud"));
You would need to create a new ArrayList of Integer for every string mapping to a value. The first time an entry is used, you create a new list (You must check that the string maps to null), and add the value of the index that the new Disk entry will be stored at in your ArrayList of Disls to you ArrayList of Integers. Any time the string maps to a non-empty list, then you just add the index (where it is in the Disk ArrayList) to the ArrayList of Integer.
Honestly, I think the best way for you to scale your solution is by using bloom filters or something sophisticated like this. This would require you to create complex hash codes, manage false positives, among other things.
Having that said, based on your design, I think what you can simply have a hash map pointing to the Disks objects that are also stored on the array list.
public HashMap<String, ArrayList<Disks>> map
For the keyword "cars", you have a list of Disks objects. For the keyword "loud" you have another list of Disks objects. Just take both lists and find the intersection, using the retainAll() method.
Make sure to override hashCode() and equals() in Disks so all collections will work fine.

Java - WEKA - Add new catergories to training set

I'm using WEKA to train a categorization Java program. There are initially several categories, let's say 10, and the system must work with those initial categories and start training. In order to do that...:
String [] categories = {"cat1", "cat2", ..., "cat10"};
public SomeClassifier(String[] categories) {
// Creates a FastVector of attributes.
FastVector attributes = new FastVector(3);
// Add attribute for holding property one.
attributes.addElement(new Attribute(P1_ATTRIBUTE, (FastVector) null));
// Add attribute for holding property two.
attributes.addElement(new Attribute(P2_ATTRIBUTE, (FastVector) null));
// Add values attribute.
FastVector values = new FastVector(categories.length);
for (int i = 0; i < categories.length; i++) {
values.addElement(categories[i]);
}
attributes.addElement(new Attribute(CATEGORY_ATTRIBUTE, values));
// Create dataset with initial capacity of 25, and set index
Instances myInstances = new Instances(SOME_NAME, attributes, 25);
myInstances.setClassIndex(myInstances.numAttributes() - 1);
}
OK, now, time goes by and I want to add a new category to my training set (let's say, "cat11"), which is already being trained with some success. How can I accomplish this? WEKA documentation says "Once an attribute has been created, it can't be changed".
So, maybe I can take out the Attribute from the Instances object, recreate the Attribute and then insert it again... or will that mess everything up?
Thanks in advance.
OK, apparently, there is no way to do such thing using this implementation of Naïve Bayes. This is because when initializing the classifier, all categories appended to the classifier must sum 1, and when the classifier is being trained, new categories with probability != 0 would cause the classifier to behave in a strange manner having a sum > 1. Morevoer, the classifier may initialize its algorithm (calculation of conditioned probabilities and iterations) with the influence of the number of categories, and adding a new one after creation would mean to rebuild the algorithm in some way.
So, that leaves a question open... what classification mechanism can I use that allows me to introduce new categories over time?

Categories