I hava read a related question here link text
It was suggested there to work with a giant file and then use RandomAccessFile.
My problem is that a matrix(consists of "0" and "1", not sparse) could be really huge. For example, a row size could be 10^10000. I need an efficient way to store such a matrix. Also, I need to work with such file (if I would store my matrix in it) in that way:
Say, I have a giant file which contains sequences of numbers. Numbers in a sequence are divided by ","(first number shows the raw number, remaining numbers show places in matrix where "1"s stay). Sequences are divided by symbol "|". In addition, there is a symbol "||" which divide all of sequences into two groups. (that is a view of two matrixes. May be it is not efficient, but I don't know the way to make it better. Do you have any ideas? =) ) I have to read, for example, 100 numbers from an each row from a first group (extract a submatrix) and determine by them which rows I need to read from the second group.
So I need the function seek(). Would it work with such a giant file?
I am a newbie. May be there are some efficient ways to store and read such a data?
There are about 10^80 atoms in the observable universe. So say you could store one bit in each atom, you need about 10^9920 universes about the same size as ours. Thats just to store one row.
How many rows were you condiering? You will need 10^9920 universes per row.
Hopefully you mean 10 000 entries and not 10^10000 Then you could use the BitSet class to store all in RAM (or you could use sth. like hadoop)
Related
The lines are as follows
A1;B1;C1
A2;B2;C2
How to find a set of unique strings and break it into non-intersecting groups by the following criterion: if two lines have coincidences of non-empty values in one or more columns, they belong to the same group. For example, lines
1,2,3
4,5,6
1,5,7
belong to one group.
Initially I thought to make through a three of HashSet (for each column) to quickly see if the string is included in the list of unique values, then adding either to the list of already grouped rows or to the list of unique rows. But the algorithm in this case has a performance bottleneck: if you want to merge groups, you must go through each group in the list. Algorithm on a large amount of data (> 1 million records), with a large number of mergers, works slowly. If the mergers are small (about thousands), it works quickly. I caught the stuck in this place and do not know how to optimize this bottleneck or whether it is necessary to use other data structures and algorithms. Can someone tell me which direction to dig. I will be grateful for any thoughts on this matter.
I'd suggest the following approach:
Create a Set<String> ungroupedLines, initially containing all the lines. You'll remove the lines as you assign them to groups.
Build three Map<String, Collection<String>>-s, as you've suggested, one per column.
Initialize an empty Collection<Collection<String>> result.
While ungroupedLines is not empty:
Create a new Collection<String> group.
Remove an element, add it to group.
Perform "depth-first search" from that element, using your three maps.
Ignore (skip) any elements that have already been removed from your ungroupedLines.
For the rest, remove them from ungroupedLines and add them to group before recursing on them.
Alternatively, you can use breadth-first search.
Add group to result.
Imagine you have a huge cache of data that is to be searched through by 4 ways :
exact match
prefix%
%suffix
%infix%
I'm using Trie for the first 3 types of searching, but I can't figure out how to approach the fourth one other than sequential processing of huge array of elements.
If your dataset is huge cosider using a search platform like Apache Solr so that you dont end up in a performance mess.
You can construct a navigable map or set (eg TreeMap or TreeSet) for the 2 (with keys in normal order) and 3 (keys in reverse)
For option 4 you can construct a collection with a key for every starting letter. You can simplify this depending on your requirement. This can lead to more space being used but get O(log n) lookup times.
For #4 I am thinking if you pre-compute the number of occurances of each character then you can look up in that table for entires that have at least as many occurances of the characters in the search string.
How efficient this algorithm is will probably depend on the nature of the data and the search string. It might be useful to give some examples of both here to get better answers.
I have a file that represent a table recorded in .csv or similar format. Table may include missing values.
I look for a solution (preferably in java), that would process my file in the incremental manner without loading everything into memory, as my file can be huge. I need to identify duplicate records in my file, being able to specify which columns I want to exclude from consideration; then produce an output grouping those duplicate records. I would add an additional value at the end with a group number and output in the same format (.csv) sorted by group number.
I hope an effective solution can be found with some hashing function. For example, reading all lines and storing a hash value with each line number, hash calculated based on the set of variables I provide as an input.
Any ideas?
Ok, here is the paper that holds the key to the answer: P. Gopalan & J. Radhakrishnan "Finding duplicates in a data stream".
All,
I am wondering what's the most efficient way to check if a row already exists in a List<Set<Foo>>. A Foo object has a key/value pair(as well as other fields which aren't applicable to this question). Each Set in the List is unique.
As an example:
List[
Set<Foo>[Foo_Key:A, Foo_Value:1][Foo_Key:B, Foo_Value:3][Foo_Key:C, Foo_Value:4]
Set<Foo>[Foo_Key:A, Foo_Value:1][Foo_Key:B, Foo_Value:2][Foo_Key:C, Foo_Value:4]
Set<Foo>[Foo_Key:A, Foo_Value:1][Foo_Key:B, Foo_Value:3][Foo_Key:C, Foo_Value:3]
]
I want to be able to check if a new Set (Ex: Set[Foo_Key:A, Foo_Value:1][Foo_Key:B, Foo_Value:3][Foo_Key:C, Foo_Value:4]) exists in the List.
Each Set could contain anywhere from 1-20 Foo objects. The List can contain anywhere from 1-100,000 Sets. Foo's are not guaranteed to be in the same order in each Set (so they will have to be pre-sorted for the correct order somehow, like a TreeSet)
Idea 1: Would it make more sense to turn this into a matrix? Where each column would be the Foo_Key and each row would contain a Foo_Value?
Ex:
A B C
-----
1 3 4
1 2 4
1 3 3
And then look for a row containing the new values?
Idea 2: Would it make more sense to create a hash of each Set and then compare it to the hash of a new Set?
Is there a more efficient way I'm not thinking of?
Thanks
If you use TreeSets for your Sets can't you just do list.contains(set) since a TreeSet will handle the equals check?
Also, consider using Guava's MultSet class.Multiset
I would recommend you use a less weird data structure. As for finding stuff: Generally Hashes or Sorting + Binary Searching or Trees are the ways to go, depending on how much insertion/deletion you expect. Read a book on basic data structures and algorithms instead of trying to re-invent the wheel.
Lastly: If this is not a purely academical question, Loop through the lists, and do the comparison. Most likely, that is acceptably fast. Even 100'000 entries will take a fraction of a second, and therefore not matter in 99% of all use cases.
I like to quote Knuth: Premature optimisation is the root of all evil.
I m MCS 2nd year student.I m doing a project in Java in which I have different images. For storing description of say IMAGE-1, I have ArrayList named IMAGE-1, similarly for IMAGE-2 ArrayList IMAGE-2 n so on.....
Now I need to develop a search engine, in which i need to find a all image's whose description matches with a word entered in search engine..........
FOR EX If i enter "computer" then I should be able to find all images whose description contain "computer".
So my question is...
How should i do this efficiently?
How should i maintain all those
ArrayList since i can have 100 of
such...? or should i use another
data structure instead of ArrayList?
A simple implementation is to tokenize the description and use a Map<String, Collection<Item>> to store all items for a token.
Building:
for(String token: tokenize(description)) map.get(token).add(item)
(A collection is needed as multiple entries could be found for a token. The initialization of the collection is missing in the code. But the idea should be clear.)
Use:
List<Item> result = map.get("Computer")
The the general purpose HashMap implementation is not the most efficient in this case. When you start getting memory problems you can look into a tree implementation that is more efficient (like radix trees - implementation).
The next step could be to use some (in-memory) database. These could be relational (HSQL) or not (Berkeley DB).
If you have a small number of images and short descriptions (< 1000 characters), load them into an array and search for words using String.indexOf() (i.e. one entry in the array == one complete image description). This is efficient enough for, say, less than 10'000 images.
Use toLowerCase() to fold the case of the characters (so users will find "Computer" when they type "computer"). String.indexOf() will also work for short words (using "comp" to find "Computer" or "compare").
If you have lots of images and long descriptions and/or you want to give your users some comforts for the search (like Google does), then use Lucene.
There is no simple, easy-to-use data structure that supports efficient fulltext search.
But do you actually need efficiency? Is this a desktop app or a web app? In the former case, don't worry about efficiency, a modern CPU can search through megabytes of text in fractions of a second - simply look through all your descriptions using String.contains() (or a regexp to allow more flexible searches).
If you really need efficiency (such as for a webapp where many people could do searches at the same time), look into Apache Lucene.
As for your ArrayLists, it seems strange to use one for the description of a single image. Why a list, what does the index represent? Lines? If so, and unless you actually need to access lines directly, replace the lists with a simple String - it can contain newline characters just fine.
I would suggest you to use the Hashtable class or to organize your content into a tree to optimize searching.