result of semantic search on java lucene

result of semantic search on java lucene - java

i've implemented the Latent Semantic Analisys on Lucene
The result of the algorithm are the matrix of 2 columns where the first is the index of the document and the second similarity.
That i want to write the response in the org.apache.lucene.search.Collector to the method search of Searcher, but i do not know how set the result in the collector object.
the code for the search method is:
public void search(Weight weight, Filter filter, Collector collector) throws IOException
{
String textQuery = weight.getQuery().toString("contents");
System.out.println(textQuery);
double[][] ind;
ind = lsa.searchOnDoc(textQuery);
//ind contains the index and the similarity
if (ind != null)
{
//construct the collector object
for (int i=0; i<ind.length; i++)
{
int doc =(int) ind[i][0];
double simi = ind[i][1]
//collector.collect(doc);
//collector.setScorer(sim]);
//This is the problem
}
}
else
{
collector = null;
}
}
i don't know the right steps to copy the value of ind in the collector object.
Can you help me?

I don't quite get why did you decide to shove LSI into Searcher.
And getting your text query from Weight looks especially shady - why not use the original query instead and skip all the (broken) conversions?
But the Collector is handled as follows.
For each segment in your index:
Supply it corresponding SegmentReader with collector.setNextReader(reader, base). You can get these with ir.getSequentialSubReaders() and ir.getSubReaderStarts() on toplevel reader. So,
reader may be used by collector to load sort fields/caches during collection, and additional fields to augment search result when collection is done,
base is the number added to segment/local docIDs (they start from 0 for each segment) to convert them to index/global docIDs.
Supply it a Scorer implementation with collector.setScorer(scorer).
collector may use it during the next phase to get the score for the documents. Though if collector only counts the results, or sorts on some stored field, or just feels so - scorer will be ignored.
The only method collectors invoke on Scorer instance is scorer.score(), which should return the score (I kid you not) for the current document being collected.
Repeatedly call collector.collect(id) with monotonically increasing sequence of segment/local docIDs that match your query.
Going back to your code - make some wrapper that implements Scorer, use a single instance with a field that you update with simi on each iteration, have wrapper's score() method return that field, shove this instance into collector with setScorer() before the loop.
You also need lsa.searchOnDoc to return per-segment results.

Related

Java : Is the get method of an Arraylist cached?

Does the Arraylist object store the last requested value in memory to access it faster the next time? Or do I need to do this myself?
Or more concretely, in terms of performance, is it better to do this :
for (int i = 0; i < myArray.size(); i++){
int value = myArray.get(i);
int result = value + 2 * value - 5 / value;
}
Instead of doing this :
for (int i = 0; i < myArray.size(); i++)
int result = myArray.get(i) + 2 * myArray.get(i) - 5 / myArray.get(i);

In terms of performance, it doesn't matter one bit. No, ArrayList doesn't cache anything, although the JITted end result could be a different issue.
If you're wondering which version to use, use the first one. It's clearer.

You can answer your (first) question yourself by looking into the actual source:
public E get(int index) {
rangeCheck(index);
return elementData(index);
}
So: No, there is no caching taking place but you can also see that there is no much of an impact in terms of performance because the get method is essentially just an access to an array.
But it's still good to avoid multiple calls for some reasons:
int result = value + 2 * value - 5 / value is easier to understand (i.e. realizing that you use the same value three times in your calculation)
If you later decide to change the underlying list (e.g. to a LinkedList) you might end up with an impact on performance and then have to change your code to get around it.
As long as you don't synchronize the access to the list, repeated calls of get(index) might actually return different values if between two calls a call of set(index, value) has taken place (even in small souce blocks like this, it's possible to happen - BTST)
The second point has also a consequence in terms of how to access all values of a list, that leads to the decision to avoid list.get(i) altogether if you're going to iterate over all elements in a list. In that case it's better to use the Iterator or streams:
You code would then look like this:
Iterator it = myArray.iterator();
while (it.hasNext()) {
int value = it.next();
int result = value + 2 * value - 5 / value;
}
LinkedList is very slow when trying to access elements in it by specific index but can iteratre quite fast from one element to the next, so the Iterator returned by LinkedList makes use of that while the Iterator returned by ArrayList simply accesses the internal array (without the need to do the repeated range check calls you can see in the get-method above

Java 8 Stream.findAny() vs finding a random element in the stream

In my Spring application, I have a Couchbase repository for a document type of QuoteOfTheDay. The document is very basic, just has an id field of type UUID, value field of type String and created date field of type Date.
In my service class, I have a method that returns a random quote of the day. Initially I tried simply doing the following, which returned an argument of type Optional<QuoteOfTheDay>, but it would seem that findAny() would pretty much always return the same element in the stream. There's only about 10 elements at the moment.
public Optional<QuoteOfTheDay> random() {
return StreamSupport.stream(repository.findAll().spliterator(), false).findAny();
}
Since I wanted something more random, I implemented the following which just returns a QuoteOfTheDay.
public QuoteOfTheDay random() {
int count = Long.valueOf(repository.count()).intValue();
if(count > 0) {
Random r = new Random();
List<QuoteOfTheDay> quotes = StreamSupport.stream(repository.findAll().spliterator(), false)
.collect(toList());
return quotes.get(r.nextInt(count));
} else {
throw new IllegalStateException("No quotes found.");
}
}
I'm just curious how the findAny() method of Stream actually works since it doesn't seem to be random.
Thanks.

The reason behind findAny() is to give a more flexible alternative to findFirst(). If you are not interested in getting a specific element, this gives the implementing stream more flexibility in case it is a parallel stream.
No effort will be made to randomize the element returned, it just doesn't give the same guarantees as findFirst(), and might therefore be faster.
This is what the Javadoc says on the subject:
The behavior of this operation is explicitly nondeterministic; it is free to select any element in the stream. This is to allow for maximal performance in parallel operations; the cost is that multiple invocations on the same source may not return the same result. (If a stable result is desired, use findFirst() instead.)

Don’t collect into a List when all you want is a single item. Just pick one item from the stream. By picking the item via Stream operations you can even handle counts bigger than Integer.MAX_VALUE and don’t need the “interesting” way of hiding the fact that you are casting a long to an int (that Long.valueOf(repository.count()).intValue() thing).
public Optional<QuoteOfTheDay> random() {
long count = repository.count();
if(count==0) return Optional.empty();
Random r = new Random();
long randomIndex=count<=Integer.MAX_VALUE? r.nextInt((int)count):
r.longs(1, 0, count).findFirst().orElseThrow(AssertionError::new);
return StreamSupport.stream(repository.findAll().spliterator(), false)
.skip(randomIndex).findFirst();
}

Efficient search in datastructure ArrayList

I've an ArrayList which contains my nodes. A node has a source, target and costs. Now I have to iterate over the whole ArrayList. That lasts for for over 1000 nodes a while. Therefore I tried to sort my List by source. But to find the corresponding pair in the List I tried the binary search. Unfortunately that works only if I want to compare either source or target. But I have to compare both to get the right pair. Is there another possibility to search an ArrayList efficient?

Unfortunately, no. ArrayLists are not made to be efficiently searched. They are used to store data and not search it. If you want to merely know if an item is contained, I would suggest you use HashSet as the lookup will have a time complexitiy of O(1) instead of O(n) for the ArrayList (assuming that you have implemented a functioning equals method for your objects).
If you want to do fast searches for objects, I recommend using an implementation of Dictionnary like HashMap. If you can afford the space requirement, you can have multiple maps, each with different keys to have a fast lookup of your object no matter what key you have to search for. Keep in mind that the lookup also requires implementing a correct equals method. Unfortunately, this requires that each key be unique which may not be a brilliant idea in your case.
However, you can use a HashMapto store, for each source, a List of nodes that have the keyed source as a source. You can do the same for cost and target. That way you can reduce the number of nodes you need to iterate over substantially. This should prove to be a good solution with a scarcely connected network.
private HashMap<Source, ArrayList<Node>> sourceMap = new HashMap<Source, ArrayList<Node>>();
private HashMap<Target, ArrayList<Node>> targetMap = new HashMap<Target, ArrayList<Node>>();
private HashMap<Cost, ArrayList<Node>> costMap = new HashMap<Cost, ArrayList<Node>>();
/** Look for a node with a given source */
for( Node node : sourceMap.get(keySource) )
{
/** Test the node for equality with a given node. Equals method below */
if(node.equals(nodeYouAreLookingFor) { return node; }
}
In order to be sure that your code will work, be sure to overwrite the equals method. I know I have said so already but this is a very common mistake.
#Override
public boolean equals(Object object)
{
if(object instanceof Node)
{
Node node = (Node) object;
if(source.equals(node.getSource() && target.equals(node.getTarget()))
{
return true;
}
} else {
return false;
}
}
If you don't, the test will simply compare references which may or may not be equal depending on how you handle your objects.
Edit: Just read what you base your equality upon. The equals method should be implemented in your node class. However, for it to work, you need to implement and override the equals method for the source and target too. That is, if they are objects. Be watchful though, if they are Nodes too, this may result in quite some tests spanning all of the network.
Update: Added code to reflect the purpose of the code in the comments.
ArrayList<Node> matchingNodes = sourceMap.get(desiredSourde).retainAll(targetMap.get(desiredTarget));
Now you have a list of all nodes that match the source and target criteria. Provided that you are willing to sacrifice a bit of memory, the lookup above will have a complexity of O(|sourceMap| * (|sourceMap|+|targetMap|)) [1]. While this is superior to just a linear lookup of all nodes, O(|allNodeList|), if your network is big enough, which with 1000 nodes I think it is, you could benefit much. If your network follows a naturally occurring network, then, as Albert-László Barabási has shown, it is likely scale-free. This means that splitting your network into lists of at least source and target will likely (I have no proof for this) result in a scale-free size distribution of these lists. Therefore, I believe the complexity of looking up source and target will be substantially reduced as |sourceMap| and |targetMap| should be substantially lower than |allNodeList|.

You'll need to combine the source and target into a single comparator, e.g.
compare(T o1, T o2) {
if(o1.source < o2.source) { return -1; }
else if(o1.source > o2.source) { return 1; }
// else o1.source == o2.source
else if(o1.target < o2.target) { return -1; }
else if(o1.target > o2.target) { return 1; }
else return 0;
}

You can use the .compareTo() method to compares your nodes.

You can create two ArrayLists. The first sorted by source, the second sorted by target.
Then you can search by source or target using binarySearch on the corresponding List.

You can make a helper class to store source-target pairs:
class SourceTarget {
public final Source source; // public fields are OK when they're final and immutable.
public final Target target; // you can use getters but I'm lazy
// (don't give this object setters. Map keys should ideally be immutable)
public SourceTarget( Source s, Target t ){
source = s;
target = t;
}
#Override
public boolean equals( Object other ){
// Implement in the obvious way (only equal when both source and target are equal
}
#Override
public int hashCode(){
// Implement consistently with equals
}
}
Then store your things in a HashMap<SourceTarget, List<Node>>, with each source-target pair mapped to the list of nodes that have exactly that source-target pair.
To retrieve just use
List<Node> results = map.get( new SourceTarget( node.source, node.target ) );
Alternatively to making a helper class, you can use the comparator in Zim-Zam's answer and a TreeMap<Node,List<Node>> with a representative Node object acting as the SourceTarget pair.

Updating both a ConcurrentHashMap and an AtomicInteger safely

I have to store words and their corresponding integer indices in a hash map. The hash map will be updated concurrently.
For example: lets say the wordList is {a,b,c,a,d,e,a,d,e,b}
The the hash map will contain the following key-value pairs
a:1
b:2
c:3
d:4
e:5
The code for this is as follows:
public class Dictionary {
private ConcurrentMap<String, Integer> wordToIndex;
private AtomicInteger maxIndex;
public Dictionary( int startFrom ) {
wordToIndex = new ConcurrentHashMap<String, Integer>();
this.maxIndex = new AtomicInteger(startFrom);
}
public void insertAndComputeIndices( List<String> words ) {
Integer index;
//iterate over the list of words
for ( String word : words ) {
// check if the word exists in the Map
// if it does not exist, increment the maxIndex and put it in the
// Map if it is still absent
// set the maxIndex to the newly inserted index
if (!wordToIndex.containsKey(word)) {
index = maxIndex.incrementAndGet();
index = wordToIndex.putIfAbsent(word, index);
if (index != null)
maxIndex.set(index);
}
}
}
My question is whether the above class is thread safe or not?
Basically an atomic operation in this case should be to increment the maxIndex and then put the word in the hash map if it is absent.
Is there a better way to achieve concurrency in this situation?

Clearly another thread can see maxIndex incrementing and then getting clobbered.
Assuming this is all that is going on to the map (in particular, no removes), then you could try putting the word in the map and only incrementing if that succeeds.
Integer oldIndex = wordToIndex.putIfAbsent(word, -1);
if (oldIndex == null) {
wordToIndex.put(word, maxIndex.incrementAndGet());
}
(Alternatively for a single put, use some sort of mutable type in place of Integer.)

No, it is not. If you have two methods A and B, both thread safe, this of course does not mean that calling A and B in a sequence is also thread safe, as a thread can interrupt another one between the function calls. This is what happens here:
if (!wordToIndex.containsKey(word)) {
index = maxIndex.incrementAndGet();
index = wordToIndex.putIfAbsent(word, index);
if (index != null)
maxIndex.set(index);
}
Thread A verifies that wordToIndex does not contain the word "dog" and proceeds inside the if. Before it can add the word "dog", thread B also finds that "dog" is not in the map (A did not add it yet) so it also proceeds inside the if. Now you have the word "dog" trying to be inserted twice.
Of course, putIfAbsent will guarantee that only one thread can add it, but I think that your goal is to not have two threads enter the if at the same time with the same key.

AtomicInteger is something you should consider using.
And you should wrap all the code that needs to happen as a transaction in a synchronized(this) block.

The other answers are correct --- there are non-thread-safe fields in your class. What you should do, to start, is make sure
how to implement the threading
1) I would make sure everything internal is private, although this is not a requirement of thread-safe code.
2) Find any of your accessor methods, make sure they are snychronized whenever the state of the global object is modified (OR AT LEAST THE IF BLOCK IS SYNCHRONIZED).
3) Test for deadlocks or bad counts, this can be implemented in a unit test by making sure the value of maxIndex is correct after 10000 threaded inserts, for example...

What is a data structure kind of like a hash table, but infrequently-used keys are deleted?

I am looking for a data structure that operates similar to a hash table, but where the table has a size limit. When the number of items in the hash reaches the size limit, a culling function should be called to get rid of the least-retrieved key/value pairs in the table.
Here's some pseudocode of what I'm working on:
class MyClass {
private Map<Integer, Integer> cache = new HashMap<Integer, Integer>();
public int myFunc(int n) {
if(cache.containsKey(n))
return cache.get(n);
int next = . . . ; //some complicated math. guaranteed next != n.
int ret = 1 + myFunc(next);
cache.put(n, ret);
return ret;
}
}
What happens is that there are some values of n for which myFunc() will be called lots of times, but many other values of n which will only be computed once. So the cache could fill up with millions of values that are never needed again. I'd like to have a way for the cache to automatically remove elements that are not frequently retrieved.
This feels like a problem that must be solved already, but I'm not sure what the data structure is that I would use to do it efficiently. Can anyone point me in the right direction?
Update I knew this had to be an already-solved problem. It's called an LRU Cache and is easy to make by extending the LinkedHashMap class. Here is the code that incorporates the solution:
class MyClass {
private final static int SIZE_LIMIT = 1000;
private Map<Integer, Integer> cache =
new LinkedHashMap<Integer, Integer>(16, 0.75f, true) {
protected boolean removeEldestEntry(Map.Entry<Integer, Integer> eldest)
{
return size() > SIZE_LIMIT;
}
};
public int myFunc(int n) {
if(cache.containsKey(n))
return cache.get(n);
int next = . . . ; //some complicated math. guaranteed next != n.
int ret = 1 + myFunc(next);
cache.put(n, ret);
return ret;
}
}

You are looking for an LRUList/Map. Check out LinkedHashMap:
The removeEldestEntry(Map.Entry) method may be overridden to impose a policy for removing stale mappings automatically when new mappings are added to the map.

Googling "LRU map" and "I'm feeling lucky" gives you this:
http://commons.apache.org/proper/commons-collections//javadocs/api-release/org/apache/commons/collections4/map/LRUMap.html
A Map implementation with a fixed
maximum size which removes the least
recently used entry if an entry is
added when full.
Sounds pretty much spot on :)

WeakHashMap will probably not do what you expect it to... read the documentation carefully and ensure that you know exactly what you from weak and strong references.
I would recommend you have a look at java.util.LinkedHashMap and use its removeEldestEntry method to maintain your cache. If your math is very resource intensive, you might want to move entries to the front whenever they are used to ensure that only unused entries fall to the end of the set.

The Adaptive Replacement Cache policy is designed to keep one-time requests from polluting your cache. This may be fancier than you're looking for, but it does directly address your "filling up with values that are never needed again".

Take a look at WeakHashMap

You probably want to implement a Least-Recently Used policy for your map. There's a simple way to do it on top of a LinkedHashMap:
http://www.roseindia.net/java/example/java/util/LRUCacheExample.shtml

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

result of semantic search on java lucene - java

Related

Java : Is the get method of an Arraylist cached?

Java 8 Stream.findAny() vs finding a random element in the stream

Efficient search in datastructure ArrayList

Updating both a ConcurrentHashMap and an AtomicInteger safely

What is a data structure kind of like a hash table, but infrequently-used keys are deleted?

Categories

Resources