How keep keys in aerospike effectively? - java

For not very big amount of data we store all keys in one bin with List.
But there are limitations on the size of bin.
Function scanAll with ScanCallback in Java client, actually works very slowly, so we cannot afford it in our project. Aerospike works fast when you give him the Key.
Now we have some sets where are a lot of records and keys. What is the best way to store all keys, or maybe there are some way to get it fast and without scanAll ?

Scanning small sets is currently an inefficient operation, because there are 4K logical partitions, and a scan thread has to reduce each of those partitions during the scan. Small sets don't necessarily have records in all the partitions, so you're paying for the overhead of scanning those regardless. This is likely to change in future versions, but is the case for now.
There are two ways to get all the records in a set faster:
If you actually know what the key space is like, you can iterate over batch-reads to fetch them (which can also be done in parallel). Trying to access a non-existent key in a batch-read does not cause an error, it just comes back with no value in the specific index.
Alternatively, you can add a bin that has the set name, and create a secondary index over that bin, then query for all the records WHERE setname=XYZ. This will come back much faster than the scan, for a small set.

Related

How can I run an outer join on two large postgreSQL tables in batches?

I have two tables with millions of rows. They share a common email address. They don't share any other fields.
I have a join operation that works fine.
select r.*,l.* from righttable r full outer join lefttable l on r.email=l.email
However, the result set contains millions of rows, which overwhelms my server's memory. How can I run consecutive queries that only pull a limited number of rows from each table at a time and ultimately visit all of the rows in the two tables?
Furthermore, after receving a result set, our server may make some inserts into one or both of the tables. I'm afraid this may complicate keeping track of the offset in each consecutive query. Maybe it's not a problem. I can't wrap my head around it.
I don't think you can do this in batches, because it won't know what rows to fabricate to fulfill the "FULL OUTER" without seeing all of the data. You might be able to get around that if you know that no one is making changes to the tables while you work, by selecting the left-only tuples, right-only tuples, and inner tuples in separate queries.
But, it should not consume all your memory (assuming you mean RAM, not disk space) on the server, because it should use temp files instead of RAM for the bulk of the storage needed (though there are some problems with memory usage for huge hash joins, so you might try set enable_hashjoin=off).
The client might use too much memory, as it might try to read the entire result set into the client RAM at once. There are ways around this, but they probably do not involve manipulating the JOIN itself. You can use a cursor to read in batches from a single result stream, or you could just spool the results out to disk using \copy, and then us something like GNU split on it.

removing duplicates in java on large scale data

I have the following issue.
I'm connecting to some place using and API and getting the data as an inputstream.
the goal is to save the data after removing duplicate lines.
duplication defined by columns 10, 15, 22.
i'm getting the data using several threads.
currently I first save the data into a csv file and then remove duplicates.
I want to do it while i'm reading the data.
the volume of the data is about 10 million records.
I have limited memory that I can use.
the machine has 32gb of memory but I am limited since there are other applications that using it.
I read here about using hash maps.
but I'm not sure I have enough memory to use it.
does any one has a suggestion how to solve this issue?
A Hashmap will use up at least as much memory as your raw data. Therefore, it is probably not feasible for the size of your data set (however, you should check that, because if it is, it's the easiest option).
What I would do is write the data to a file or database, compute a hash value for the fields to be deduplicated, and store the hash values in memory with a suitable reference to the file (e.g. the byte index of where the original value is in the written file). The reference should of course be as small as possible.
When you hit a hash match, look up the original value and check whether it is identical (as hashes for different values may fall together).
The question, now, is how many duplicates you expect. If you expect few matches, I would choose a cheap write and expensive read solution, i.e. dumping everything linearly into a flat file and reading back from that file.
If you expect many matches, it's probably the other way round, i.e. having an indexed file or set of files, or even a database (make sure it's a database where write operations are not too expensive).
The solution depends on how big is your data in columns 10, 15, 22.
Assuming it's not too big (say, ca. 1kb) you can actually implement an in-memory solution.
Implement a Key class to store values from columns 10, 15, 22. Carefully implement equals and hashCode methods. (You may also use a normal ArrayList instead.)
Create a Set which would contain keys of all records you read.
For each record you read, check if it's key is already in that set. If yes, skip the record. If not, write the record to output, add the key to the set. Make sure you work with set in a thread-safe manner.
In the worst case you'll need number of records * size of key amount of memory. For 10000000 records and the assumed <1kb per key this should work with around 10GB.
If the key size is still too large, you'll probably need a database to store the set of key.
Another option would be storing hashes of keys instead of full keys. This will require much less memory but you may be getting hash collisions. This may lead to "false positives", i.e. false duplicates which aren't actually duplicates. To completely avoid this you'll need a database.
You can use ConcurrentHashSet. it will automatically remove the duplicate element and it's thread safe up to a certain limit

Is ConcurrentHashMap good for rapidly updating concurrent data structures?

I'm designing an application that maps an IP address to certain info about that IP address. Currently, I have the information stored in a ConcurrentHashMap. The list of keys could change frequently, so I grab the latest copy of the list and update it once every minute.
However, I could possibly be querying this data structure a few thousand times a minute. Does it make sense to use a ConcurrentHashMap? Would there be a significant delay (larger than 1ms) when the list is being updated? There could be up to 1000 items in the list.
Thanks for your help!
You can search about ConcurrentHashMap, in the documentation, and you will see that the retrieval operations generally do not block, so they act just like in a normal HashMap. Retrieval operations includes remove and get. You said that the list of keys could change frequently, so yes, I recommend you to use ConcurrentHashMap.
Here is the documentation: http://docs.oracle.com/javase/7/docs/api/java/util/concurrent/ConcurrentHashMap.html
With so few items, it's very unlikely that there will be any delay. A good baseline with significant overhead for error is approximately 10 million operations per second when looking at big-O notation.
Given the fact that you are only querying a few thousand times a minute and a hash map is O(1), there shouldn't be any problem.

Distributed multimap based on HBase and Hadoop MapReduce

I'm sorry that I haven't deeply understood HBase and Hadoop MapReduce, but I think you can help me to find the way of using them, or maybe you could propose frameworks I need.
Part I
There is 1st stream of records that I have to store somewhere. They should be accessible by some keys depending on them. Several records could have the same key. There are quite a lot of them. I have to delete old records by timeout.
There is also 2nd stream of records, that is very intensive too. For each record (argument-record) I need to: get all records from 1st strem with that argument-record's key, find first corresponding record, delete it from 1st stream storage, return the result (res1) of merging these two records.
Part II
The 3rd stream of records is like 1st. Records should be accessable by keys (differ from that ones of part I). Several records as usual will have the same key. There are not so many of them like in the 1st stream. I have to delete old records by timeout.
For each res1 (argument-record) I have to: get all records from 3rd strem with that record's another key, map these records having res1 as parameter, reduce into result. 3rd stream records should stay unmodified in storage.
The records with the same key are prefered to be stored at the same node, and procedures that get records by the key and make some actions based on given argument-record are preferred to be run on the node where that records are.
Are HBase and Hadoop MapReduce applicable in my case? And how such app should look like (base idea)? If the answer is no, is there frameworks to buld such app?
Please, ask questions, if you couldn't get what I want.
I am relating to the storage backend technologies. Front end accepting records can be stateless and thereof trivially scalable.
We have streams of records and we want to join them on the fly. Some of records should be persisted why some (as far as I understood - 1st stream) are transient.
If we take scalability and persistence out of equation - it can be implemented in single java process using HashMap for randomly accessible data and TreeMap for data we want to store sorted
Now let see how it can be mapped into NoSQL technologies to gain scalability and performance we need.
HBase is distributed sorted map. So it can be good candidate for stream 2. If we used our key as hbase table key - we will gain data locality for the records with the same key.
MapReduce on top of HBase is also available.
Stream 1 looks like transient randomly accessed data. I think it does not make sense to pay a price of persistence for those records - so distributed in memory hashtable should do. For example: http://memcached.org/ Probably element of storage there will be list of records with the same key.
I still not 100% sure about 3rd stream requirements but need for secondary index (if it known beforehand) can be implemented on application level as another distributed map.
In a nutshell - my suggestion to pick up HBase for data you want to persist and store sorted and consider some more lightweight solutions for transient (but still considerable big) data.

Is it faster to access a java list (arraylist) compared to accessing the same data in a mysql database?

I have the MYSQL database in the local machine where I'm running the java program from.
I plan create a array list of all the entries of a particular table. From this point on wards I will not access the database to get a particular entry in the table, instead I will use the array list created. Is this going to be faster or slower compared to accessing the database to grab a particular entry in the table?
Please note that the table I'm interested has about 2 million entries.
Thank you.
More info : I need only two fields. 1 of type Long and 1 of type String. The index of the table is Long , not int.
No, it's going to be much slower, because to find an element in an ArrayList, you've to scan sequentially the ArrayList until your element is found.
It can be faster, for a few hundreds entry, because you don't have the connection overhead, but with two millions entry, MySQL is going to win, provided that you create the correct indexes. Only retrieve the rows that you actually need each time.
Why are you thinking to do this? Are you experiencing slow queries?
To find out, in your my.cnf activate the slow query log, by uncommenting (or adding) the following lines.
# Here you can see queries with especially long duration
log_slow_queries = /var/log/mysql/mysql-slow.log
long_query_time = 1
Then see which queries take a long time, and run them with EXPLAIN in front, consider to add index where the explain command tells you that is not using indexes, or just post a new question with your CREATE TABLE statement and your example query to optimize.
This question is too vague, and can easily go either way depending on:
How many fields in each record, how big are the fields?
What kind of access are you going to perform? Text search? Sequential?
For example, if each records consists of a couple bytes of data it's much faster to store them all in-memory (not necessarily an ArrayList though). You may want to put them into a TreeSet for example.
It depends on what you will do with the data. If you just wanted a few rows, only those should be fetched from the DB. If you know that you need ALL the data, go ahead and load the whole table into java if it can fit in memory. What will you do with it after? Sequencial or random reading? Will data be changed? A Map or Set could be a faster alternative depending on how the collection will be used.
Whether it is faster or slower is measurable. Time it. It is definitely faster to work with structures stored in memory than it is to work with data tables located on the disk. That is if you have enough memory and if you do not have 20 users running the same process at the same time.
How do you access the data? Do you have an integer index?
First, accessing an array list is much much faster than accessing a data base. Accessing memory is much more faster than accessing a hard disk.
If the number of entries in the array is big and I guess it is, then you need to consider using a "direct access" data structure such as a HashMap which will act as a database table where you have values referenced by their keys

Categories