I've spent a couple of hours reading posts that were related to the question in a bid to try and come up with a solution, but I wasn't really successful in coming up with one.
So here goes: I was once asked in an interview which data structure I would use to search a if a particular word existed in a file. The file is also supposedly big enough to not be able to fit in the memory and the interviewer was really looking for an on-disk solution.
Is the B-Tree an on-disk data structure?
A Binary Search Tree is an in-memory data structure isn't it?
There are really two different possible questions here:
Given a massive file, and a word, how do you check if the word exists in the file?
Given a massive file, how do you build an index so that you can efficiently check if an arbitrary word exists in the file?
The first problem is efficiently solved with Boyer-Moore and a linear search through the file. If you're only searching once, building an index is a complete waste of time.
Regarding the second problem, it sounds like the interviewer is really pushing B-Trees.
Both are just data-structures and can be both on-disk or in-memory. It depends on how you choose to use them.
btw, B-Trees were motivated by a need to have on-disk structures. Binary search trees are just a special case of B-trees, in one way.
You want to use a data structure that maps one node to one page of disk space. This will minimize disk activity.
Because a B-Tree is often used for this. See http://en.wikipedia.org/wiki/B-tree, specifically the section "Time to search a sorted file".
Related
I'm creating a task to parse two large XML files and find 1-1 relation between elements. I am completely unable to keep whole file in memory and I have to "jump" in my file to check up to n^2 combinations.
I am wondering what approach may I take to navigate between nodes without killing my machine. I did some reading on StAX and I liked the idea but cursor moves one way only and I will have to go back to check different possibilities.
Could you suggest me any other possibility? I need one with commercial use allowance.
I'd probably consider reading the first file into some sort of structured cache and then read the 2nd XML document, referencing against this cache (the cache could actually be a DB - it doesn't need to be in memory).
Otherwise there's no real solution (that I know of) unless you could read the whole file into memory. This ought to perform better too rather than going back and forth across the DOM of an XML document.
One solution would be an XML database. These usually have good join optimizers so as well as saving memory they may be able to avoid the O(n^2) elapsed time.
Another solution would be XSLT, using xsl:key to do "manual" optimization of the join logic.
If you explain the logic in more detail there may turn out to be other solutions using XSLT 3.0 streaming.
I'm looking for the fastest approach, in Java, to store ~1 billion records of ~250 bytes each (storage will happen only once) and then being able to read it multiple times in a non-sequential order.
The source records are being generated into simple java value objects and I would like to read them back in the same format.
For now my best guess is to store these objects, using a fast serialization library such as Kryo, in a flat file and then to use Java FileChannel to make direct random access to read the records at specific positions in the file (when storing the data, I will keep in a hashmap (also to be saved on disk) with the position in the file of each record so that I know where to read it).
Also, there is no need to optimize disk space. My key concern is to optimize read performance, while having a reasonable write performance (that, again, will happen only once).
Last precision: while the records are all of the same type (same Java value object), their size (in bytes) is variable (e.g. it contains strings).
Is there any better approach than what I mentioned above? Any hint or suggestion would be greatly appreciated !
Many thanks,
Thomas
You can use Apache Lucene, it will take care of everything you have mentioned above :)
It is super fast, you can search results more quickly then ever.
Apache Lucene persist objects in files and indexes them. We have used it in couple of apps and it is super fast.
You could just use an embedded Derby database. It's written in Java and you can actually run it up embedded within your process so there is no overhead of inter-process or networked communication. It will store the data and allow you to query it/etc handling all the complexity and indexing for you.
After several hours I've searched information from internet, I still feel not sure anything. My problem is: i want to implement a dictionary on android devices (java base), my requirements are speed and then memory-efficiency, but I couldn't make a decision on which data structure to use for searching.
I have a list of data structures, help me understands them and choose one:
Ternary tree
TRIE
Aho–Corasick tree
[...your suggest DS...]
And will be very kind if somebody can guide me about getting results (many fields: pronounce, mean, example sentence...) of word after we found it? We will save these info on another data file?
You need to list the major concerns of your design before searching data structures. What functions does this dictionary offer? What are the major features of it? Fast search? Space compactness? Insertion/deletion friendly? Cross-referencing friendly? Only when you have these in your mind you may measure how good a candidate structure is.
It can be implemented in several ways, one of them is Trie. The route is represented by the digits and the nodes point to collection of words. Usage of trie is explained here
Agree with Hunter Mcmillen's comment. In case you need the words to be sorted alphabetically like a regular dictionary you can use Java TreeMap which is a SortedMap.
I have names of all the employees of my company (5000+). I want to write an engine which can on the fly find names in online articles(blogs/wikis/help documents) and tag them with "mailto" tag with the users email.
As of now I am planning to remove all the stop words from the article and then search for each word in a lucene index. But even in that case I see a lot of queries hitting the indexes, for example if there is an article with 2000 words and only two references to people names then most probably there will be 1000 lucene queries.
Is there a way to reduce these queries? Or a completely other way of achieving the same?
Thanks in advance
If you have only 5000 names, I would just stick them into a hash table in memory instead of bothering with Lucene. You can hash them several ways (e.g., nicknames, first-last or last-first, etc.) and still have a relatively small memory footprint and really efficient performance.
http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm
This algorithm might be of use to you. The way this would work is you first compile the entire list of names into a giant finite state machine (which would probably take a while), but then once this state machine is built, you can run it through as many documents as you want and detect names pretty efficiently.
I think it would look at every character in each document only once, so it should be much more efficient than tokenizing the document and comparing each word to a list of known names.
There are a bunch of implementations available for different languages on the web. Check it out.
I'm working on a Java project for class that stores workout information in a flat file. Each file will have the information for one exercise (BenchPress.data) that holds the time (milliseconds since epoch), weight and repetitions.
Example:
1258355921365:245:12
1258355921365:245:10
1258355921365:245:8
What's the most efficient way to store and retrieve this data? It will be graphed and searched through to limit exercises to specific dates or date ranges.
Ideas I had was to write most recent information to the top of the file instead of appending at the end. This way when I start reading from the top, I'll have the most recent information, which will match most of the searches (assumption).
There's no guarantee on the order of the dates, though. A user could enter exercises for today and then go in and enter last weeks exercises, for whatever reason. Should I take the hit upon saving to order all of the information by the date?
Should I go a completely different direction? I know a database would be ideal, but this is a group project and managing a database installation and data synch amongst us all would not be ideal. The others have no experience with databases and it'll make grading difficult.
So thanks for any advice or suggestions.
-John
Don't overcomplicate things. Unless you are dealing with million records you can just read the whole thing into memory and sort it any way you like. And always add records in the end, this way you are less likely to damage your file.
For simple projects, using an embedded like JavaDB / Apache Derby may be a good idea. Configuration for the DB is absolutely minimal and in your case, you may need a maximum of just 2 tables (User and Workout). Exporting data to file is also fairly simple for sync between team members.
As yu_sha pointed out though, unless expect to have a large dataset ( to run on a PC , > 50000), you can just use the file and read everything into memory.
Read in every line via BufferedReader and parse with StringTokenizer. Looking at the data, I'd likely store an array of fields in a List that can be iterated and sorted according to your preference.
If you must store the file in this format, you're likely best off just reading the entire thing into memory at startup and storing it in a TreeMap or some other sorted, searchable map. Then you can use TreeMap's convenience methods such as ceilingKey or the similar floorKey to find matches near certain dates/times.
Use flatworm, a Java library allowing to parse and create flat files. Describe the format with a simple XML definition file, and there you go.