I have the data: number[M][N], it's inputted through a stream, so I can put it whatever data structure I want.
I have to search through it many times using different pairs of short values. So I need to get numbers of rows using values in two columns.
I can create an additional array and use a binary search to find positions using it in inputted data, something like an index in a data base, but is there a standard libraries to solve a task like this?
You can put it into more than one data structure if the searching warrants this. You could have the data in a HashMap, TreeMap, and another Map that would have the key-value mapping the other way around (if that makes sense in your case).
What's the data like, and how do you need to search it?
Related
The repartitionAndSortWithinPartitions method works great.
But I don't really want to re-partition. I am happy with the way data is partitioned naturally.
I do want to sort the content of each partition.
I am not interested in a total sort.
Essentially, I want to avoid the reshuffling of data. I just need to get each partition content sorted.
this sorts the data within the partition.
df.sortWithinPartitions('<sort_column>').show()
I need to read large text file, parse each line and store parsed content in a map, with a String key mapped to objects I will create. For large maps, it seems it consumes memory quickly. I need to sort the map before it is output into a file; otherwise, I don't need to store all key-value pairs in memory.
I searched, and some suggested using map-reduce, and some suggested a database. In particular, it says Berkeley DB is a good choice. Is it straightforward to sort large key-pair values in Berkeley DB in Java and is it convenient to use it?
So I've found questions similar to this one, but none that have helped me with my problem. So I have an ArrayList< ArrayList < String > >. This basically creates a table of user inputs, so you can add columns and each column can have different amounts within them. I need to cycle through the combinations that can be created without comparing objects in the same column. Ideally I could send it through a nested for loop and access each element using an if statement to separate as needed, but since it is a dynamic size I haven't been able to find a way to do this that doesn't compare within the same column as well. Thank you in advance for your help.
If I'm understanding your problem correctly, it sounds like you have a List of Lists, where the first List is kind of like a key, where each slot is a list of the data you need. I ran into a very similar problem, and I was able to use a Map to hold the values. If order matters, then you'll want to use a TreeMap.
I mention the Maps, because you mention you want to manipulate (what sounds like the rows in a table), rather than the columns. If you use a TreeMap, then the keys stay in the same order, and the value for each key will be like the rows in the table. Then, the index in each List would be the column.
Without a solid example of your data, I'm not able to really go into how to compare the "combinations", which I assume can be handled by the Lists in the values of the Map, in this situation.
I have the following issue.
I'm connecting to some place using and API and getting the data as an inputstream.
the goal is to save the data after removing duplicate lines.
duplication defined by columns 10, 15, 22.
i'm getting the data using several threads.
currently I first save the data into a csv file and then remove duplicates.
I want to do it while i'm reading the data.
the volume of the data is about 10 million records.
I have limited memory that I can use.
the machine has 32gb of memory but I am limited since there are other applications that using it.
I read here about using hash maps.
but I'm not sure I have enough memory to use it.
does any one has a suggestion how to solve this issue?
A Hashmap will use up at least as much memory as your raw data. Therefore, it is probably not feasible for the size of your data set (however, you should check that, because if it is, it's the easiest option).
What I would do is write the data to a file or database, compute a hash value for the fields to be deduplicated, and store the hash values in memory with a suitable reference to the file (e.g. the byte index of where the original value is in the written file). The reference should of course be as small as possible.
When you hit a hash match, look up the original value and check whether it is identical (as hashes for different values may fall together).
The question, now, is how many duplicates you expect. If you expect few matches, I would choose a cheap write and expensive read solution, i.e. dumping everything linearly into a flat file and reading back from that file.
If you expect many matches, it's probably the other way round, i.e. having an indexed file or set of files, or even a database (make sure it's a database where write operations are not too expensive).
The solution depends on how big is your data in columns 10, 15, 22.
Assuming it's not too big (say, ca. 1kb) you can actually implement an in-memory solution.
Implement a Key class to store values from columns 10, 15, 22. Carefully implement equals and hashCode methods. (You may also use a normal ArrayList instead.)
Create a Set which would contain keys of all records you read.
For each record you read, check if it's key is already in that set. If yes, skip the record. If not, write the record to output, add the key to the set. Make sure you work with set in a thread-safe manner.
In the worst case you'll need number of records * size of key amount of memory. For 10000000 records and the assumed <1kb per key this should work with around 10GB.
If the key size is still too large, you'll probably need a database to store the set of key.
Another option would be storing hashes of keys instead of full keys. This will require much less memory but you may be getting hash collisions. This may lead to "false positives", i.e. false duplicates which aren't actually duplicates. To completely avoid this you'll need a database.
You can use ConcurrentHashSet. it will automatically remove the duplicate element and it's thread safe up to a certain limit
Problem Description
I'm writing Android application which is working with big data, I have database (15 mb) and my application shows data from it. I have a queries which are get data from database already sorted for example alphabetic or depending on some parameters which I have provided.
Question
As I store data in the Array and then show it to user I want to know what is the fast way to sort data, while making a query or just put data in array and then sort it?
i am also faced this situation in my application. i resolved performance by using following way.
First i created index on my table based on primary key.
Then i used order by to sort the elements.
To search it in local i kept total content in one object , then perform search on that object.
If you use these surly you will improve performance 200 %.