Which data structure should be chosen? [Android Dictionary] - java

After several hours I've searched information from internet, I still feel not sure anything. My problem is: i want to implement a dictionary on android devices (java base), my requirements are speed and then memory-efficiency, but I couldn't make a decision on which data structure to use for searching.
I have a list of data structures, help me understands them and choose one:
Ternary tree
TRIE
Aho–Corasick tree
[...your suggest DS...]
And will be very kind if somebody can guide me about getting results (many fields: pronounce, mean, example sentence...) of word after we found it? We will save these info on another data file?

You need to list the major concerns of your design before searching data structures. What functions does this dictionary offer? What are the major features of it? Fast search? Space compactness? Insertion/deletion friendly? Cross-referencing friendly? Only when you have these in your mind you may measure how good a candidate structure is.

It can be implemented in several ways, one of them is Trie. The route is represented by the digits and the nodes point to collection of words. Usage of trie is explained here

Agree with Hunter Mcmillen's comment. In case you need the words to be sorted alphabetically like a regular dictionary you can use Java TreeMap which is a SortedMap.

Related

How search in a list of coordinate in efficient way

I'm tring to use a efficient way to search if one Coordinate (that is not in arrys) is in arrays of coordinates. I wouldn't read the entire arrays I would a better solution. Anyone can help me?
As Chief Two Pencils commented, this only works if you have some kind of ordering principle on your array. There are lots of good data types you could use to help you with this: Range trees, Quadtrees, and k-d Trees are a few that spring to mind.
If you can't change the structure of your data, you still have options. I can imagine an algorithm where you filter by x-coordinate, and then filter those by y-coordinate, and the performance wouldn't even be terrible.

Which Java data object to use for multidimensional range matching?

Project Background:
I am writing a map tile overlay class for java that can use gdal2tile.py tiles. Basically I will end up with thousands of jpg files that are in a file structure like
"Zoom Level/X coordinate/Y coordinate"
The coordinates are ints but will not necessarily start at 0 or 1.
I will have to search for tiles that are within a certain range to find out which ones I need to render.
My Problem:
I tried iterating using the file structure itself but it is wicked slow (not surprising).
I tried iterating using an ArrayList of strings of the file structure and .contains() but it seems to be even slower (not too surprising).
Optimally I would like to use a data structure that would let me choose a range on multiple dimensions so that I can call something like.
Tiles.getWhere(Zoom Level,min X,max X,min Y,maxY);
I assume that some sort of Collection or TreeMap would be the right choice but I'm not experienced enough with Java to know for sure and I'd prefer not to have to benchmark a lot of different approaches.
I could use SQLite to do it but that seems like overkill.
My Question:
What is the most efficient way to check for the existence of datasets given multiple dimensional constraints?
May be you are looking for a map with multiple keys.
Commons-collections provides a map with multiple lookup keys:
http://commons.apache.org/collections/apidocs/org/apache/commons/collections/map/MultiKeyMap.html
a map guarantees a O(1) insertion and O(1) selection timings.
Thinking of your problem I could find out three directions to which you could aim your search next (this is not a hand-by-hand guide but rather a out-of-the-box brain opener for a stucked situation you have faced):
1) Usage of Java built in structures. Yes, indeed, a list is the worst case of a searching method. A Map, as the name suggests, is far more convenient for maps. It is not only the name, but the indexing to a Map is signifigantly less time consuming compared to a List. You can imagine your map as a cube, where you have to handle about half of the dots inside it, if you use List and probably only a narrow layer of it when you search by indexing a Map. There is a magnitude of difference. So, my answer here: Map is a key word towards the correct direction (assuming you want to do it in this way after reading on my answer).
2) Usage of a Map Server solution. This is probably too far from your approach, but entire frameworks are made for solving your type of question. An example is GeoServer. It has a ready made solution for the entire problem. It is a stable solution for the great big problem possibly in your hand: showing a map to a user from a source.
3) Sticking to the GDAL framework you were using, you could select slightly different py-file, like gdal_proximity.py and - wow! - you have a searching possibility in your hand! This particular one searches by a center point and a distance, but will do the stuff you need =)
There is a starting point, how I would make it. Could this serve for something?
Sounds to me like you are looking for something like an Interval Tree.
http://en.wikipedia.org/wiki/Interval_tree
I have implemented one of these in the past but only in one dimension. The Wikipedia reference mentions extensions to more dimensions.
Paul

Tagging of names using lucene/java

I have names of all the employees of my company (5000+). I want to write an engine which can on the fly find names in online articles(blogs/wikis/help documents) and tag them with "mailto" tag with the users email.
As of now I am planning to remove all the stop words from the article and then search for each word in a lucene index. But even in that case I see a lot of queries hitting the indexes, for example if there is an article with 2000 words and only two references to people names then most probably there will be 1000 lucene queries.
Is there a way to reduce these queries? Or a completely other way of achieving the same?
Thanks in advance
If you have only 5000 names, I would just stick them into a hash table in memory instead of bothering with Lucene. You can hash them several ways (e.g., nicknames, first-last or last-first, etc.) and still have a relatively small memory footprint and really efficient performance.
http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm
This algorithm might be of use to you. The way this would work is you first compile the entire list of names into a giant finite state machine (which would probably take a while), but then once this state machine is built, you can run it through as many documents as you want and detect names pretty efficiently.
I think it would look at every character in each document only once, so it should be much more efficient than tokenizing the document and comparing each word to a list of known names.
There are a bunch of implementations available for different languages on the web. Check it out.

Optimal on-disk data structure for searching a file?

I've spent a couple of hours reading posts that were related to the question in a bid to try and come up with a solution, but I wasn't really successful in coming up with one.
So here goes: I was once asked in an interview which data structure I would use to search a if a particular word existed in a file. The file is also supposedly big enough to not be able to fit in the memory and the interviewer was really looking for an on-disk solution.
Is the B-Tree an on-disk data structure?
A Binary Search Tree is an in-memory data structure isn't it?
There are really two different possible questions here:
Given a massive file, and a word, how do you check if the word exists in the file?
Given a massive file, how do you build an index so that you can efficiently check if an arbitrary word exists in the file?
The first problem is efficiently solved with Boyer-Moore and a linear search through the file. If you're only searching once, building an index is a complete waste of time.
Regarding the second problem, it sounds like the interviewer is really pushing B-Trees.
Both are just data-structures and can be both on-disk or in-memory. It depends on how you choose to use them.
btw, B-Trees were motivated by a need to have on-disk structures. Binary search trees are just a special case of B-trees, in one way.
You want to use a data structure that maps one node to one page of disk space. This will minimize disk activity.
Because a B-Tree is often used for this. See http://en.wikipedia.org/wiki/B-tree, specifically the section "Time to search a sorted file".

Text processing / comparison engine

I'm looking to compare two documents to determine what percentage of their text matches based on keywords.
To do this I could easily chop them into a set word of sanitised words and compare, but I would like something a bit smarter, something that can match words based on their root, ie. even if their tense or plurality is different. This sort of technique seems to be used in full text searches, but I have no idea what to look for.
Does such an engine (preferably applicable to Java) exist?
Yes, you want a stemmer. Lauri Karttunen did some work with finite state machines that was amazing, but sadly I don't think there's an available implementation to use. As mentioned, Lucene has stemmers for a variety of languages and the OpenNLP and Gate projects might help you as well. Also, how were you planning to "chop them up"? This is a little trickier than most people think because of punctuation, possesives, and the like. And just splitting on white space doesn't work at all in many languages. Take a look at OpenNLP for that too.
Another thing to consider is that just comparing the non stop-words of the two documents might not be the best approach for good similarity depending on what you are actually trying to do because you lose locality information. For example, a common approach to plagiarism detection is to break the documents into chunks of n tokens and compare those. There are algorithms such that you can compare many documents at the same time in this way much more efficiently than doing a pairwise comparison between each document.
I don't know of a pre-built engine, but if you decide to roll your own (e.g., if you can't find pre-written code to do what you want), searching for "Porter Stemmer" should get you started on an algorithm to get rid of (most) suffixes reasonably well.
I think Lucene might be along the lines of what your looking for. From my experience its pretty easy to use.
EDIT: I just reread the question and thought about it some more. Lucene is a full-text search engine for java. However, I'm not quite sure how hard it would be to re purpose it for what your trying to do. either way, it might be a good resource to start looking at and go from there.

Categories