Distributed multimap based on HBase and Hadoop MapReduce

Distributed multimap based on HBase and Hadoop MapReduce - java

I'm sorry that I haven't deeply understood HBase and Hadoop MapReduce, but I think you can help me to find the way of using them, or maybe you could propose frameworks I need.
Part I
There is 1st stream of records that I have to store somewhere. They should be accessible by some keys depending on them. Several records could have the same key. There are quite a lot of them. I have to delete old records by timeout.
There is also 2nd stream of records, that is very intensive too. For each record (argument-record) I need to: get all records from 1st strem with that argument-record's key, find first corresponding record, delete it from 1st stream storage, return the result (res1) of merging these two records.
Part II
The 3rd stream of records is like 1st. Records should be accessable by keys (differ from that ones of part I). Several records as usual will have the same key. There are not so many of them like in the 1st stream. I have to delete old records by timeout.
For each res1 (argument-record) I have to: get all records from 3rd strem with that record's another key, map these records having res1 as parameter, reduce into result. 3rd stream records should stay unmodified in storage.
The records with the same key are prefered to be stored at the same node, and procedures that get records by the key and make some actions based on given argument-record are preferred to be run on the node where that records are.
Are HBase and Hadoop MapReduce applicable in my case? And how such app should look like (base idea)? If the answer is no, is there frameworks to buld such app?
Please, ask questions, if you couldn't get what I want.

I am relating to the storage backend technologies. Front end accepting records can be stateless and thereof trivially scalable.
We have streams of records and we want to join them on the fly. Some of records should be persisted why some (as far as I understood - 1st stream) are transient.
If we take scalability and persistence out of equation - it can be implemented in single java process using HashMap for randomly accessible data and TreeMap for data we want to store sorted
Now let see how it can be mapped into NoSQL technologies to gain scalability and performance we need.
HBase is distributed sorted map. So it can be good candidate for stream 2. If we used our key as hbase table key - we will gain data locality for the records with the same key.
MapReduce on top of HBase is also available.
Stream 1 looks like transient randomly accessed data. I think it does not make sense to pay a price of persistence for those records - so distributed in memory hashtable should do. For example: http://memcached.org/ Probably element of storage there will be list of records with the same key.
I still not 100% sure about 3rd stream requirements but need for secondary index (if it known beforehand) can be implemented on application level as another distributed map.
In a nutshell - my suggestion to pick up HBase for data you want to persist and store sorted and consider some more lightweight solutions for transient (but still considerable big) data.

Related

How keep keys in aerospike effectively?

For not very big amount of data we store all keys in one bin with List.
But there are limitations on the size of bin.
Function scanAll with ScanCallback in Java client, actually works very slowly, so we cannot afford it in our project. Aerospike works fast when you give him the Key.
Now we have some sets where are a lot of records and keys. What is the best way to store all keys, or maybe there are some way to get it fast and without scanAll ?

Scanning small sets is currently an inefficient operation, because there are 4K logical partitions, and a scan thread has to reduce each of those partitions during the scan. Small sets don't necessarily have records in all the partitions, so you're paying for the overhead of scanning those regardless. This is likely to change in future versions, but is the case for now.
There are two ways to get all the records in a set faster:
If you actually know what the key space is like, you can iterate over batch-reads to fetch them (which can also be done in parallel). Trying to access a non-existent key in a batch-read does not cause an error, it just comes back with no value in the specific index.
Alternatively, you can add a bin that has the set name, and create a secondary index over that bin, then query for all the records WHERE setname=XYZ. This will come back much faster than the scan, for a small set.

A hybrid of cache based and query based paging in hibernate/JPA

If the result set is large, then having the entire result set in memory (server cache e.g. hazelcast) will not be feasible. With large result sets, you cannot afford to have them in memory. In such case, you have to fetch a chunk of data at a time (query based paging). The down side of using query based paging, is that there will be multiple calls to the database for multiple page requests.
Can anyone suggest how to implement a hybrid approach of it.
I haven't put any sample code here since I think the question is more about a logic instead of specific code. Still if you need sample code I can put it.
Thanks in advance.

The most effective solution is to use the primary key as a paging criterion.This enables us to rely of first class constructs like a between range query which is simple for the RDBMS to optimize, the primary key of the queried entity will most likely be indexed already.
Retrieving data using a range query on the primary key is a two-step process. First one have to retrieve the collection of primary-keys, followed by a step to generate the intervals to properly identify a proper subset of the data,followed by the actual queries against the data.
This approach is almost as fast as the brute-force version. The memory consumption is about one tenth. By selecting the appropriate page-size for this implementation, you may alter the ratio between execution time and memory consumption. This version is also stateless, it does not keep references to resources like the ScrollableResults version does, nor does it strain the database like the version using setFirstResult/setMaxResult.
Effective pagination using Hibernate

Is the IN relation in Cassandra bad for queries?

Given an example of the following select in CQL:
SELECT * FROM tickets WHERE ID IN (1,2,3,4)
Given ID is a partition key, is using IN relation better than doing multiple queries or is there no difference?

I remembered seeing someone answer this question in the Cassandra user mailing list a short while back, but I cannot find the exact message right now. Ironically, Cassandra Evangelist Rebecca Mills just posted an article that addresses this issue (Things you should be doing when using Cassandra drivers...points #13 and #22). But the answer is "yes" that in some cases, multiple, parallel queries would be faster than using an IN. The underlying reason can be found in the DataStax SELECT documentation.
When not to use IN
...Using IN can degrade performance because
usually many nodes must be queried. For example, in a single, local
data center cluster with 30 nodes, a replication factor of 3, and a
consistency level of LOCAL_QUORUM, a single key query goes out to two
nodes, but if the query uses the IN condition, the number of nodes
being queried are most likely even higher, up to 20 nodes depending on
where the keys fall in the token range.
So based on that, it would seem that this becomes more of a problem as your cluster gets larger.
Therefore, the best way to solve this problem (and not have to use IN at all) would be to rethink your data model for this query. Without knowing too much about your schema, perhaps there are attributes (column values) that are shared by ticket IDs 1, 2, 3, and 4. Maybe using something like level or group (if tickets are for a particular venue) or maybe even an event (id), instead.
Basically, while using a unique, high-cardinality identifier to partition your data sounds like a good idea, it actually makes it harder to query your data (in Cassandra) later on. If you could come up with a different column to partition your data on, that would certainly help you in this case. Regardless, creating a new, specific column family (table) to handle queries for those rows is going to be a better approach than using IN or multiple queries.

Yes, its better to query individually than using IN in Cassandra.
For this query, the coordinator has to get the data from 4 different partitions and if each partition is very big then the data gets filled in JVM which can cause problem.
Instead querying the data using multiple queries is better as each query is individual and don't have to wait for other partitions data to send it back to user.

How to efficiently store multiple different counter values on a user in a MySQL based application?

I want to store different kinds of counters for my user.
Platform: Java
E.g. I have identified:
currentNumRecords
currentNumSteps
currentNumFlowsInterval1440
currentNumFlowsInterval720
currentNumFlowsInterval240
currentNumFlowsInterval60
currentNumFlowsInterval30
etc.
Each of the counters above needs to be reset at the beginning of each month for each user. The value of each counter can be unpredictably high with peaks etc. (I mean that a lot of things are counted, so I want to think about a scalable solution).
Now my question is what approach to take to:
a) Should I have separate columns for each counter on the user table and doing things like 'Update set counterColumn = counterColumn+ 1' ?
b) put all the values in some kind of JSON/XML and put it in a single column? (in this case I always have to update all values at once)
The disadvantage I see is row locking on the user table everytime a single counter is incremented.
c) having an separate counter table with 3 columns (userid, name, counter) and doing one INSERT for each count + having a background job doing aggregates which are written to the User table? In this case would it be ok to store the aggregated counters as JSON inside a column in the user table?
d) Doing everything in MySQL or also use another technology? I also thought about using another solution for storing counters and only keeping the aggregates in MySQL. E.g. I have experimented with Apache Cassandra's distributed counters. My concerns are about the Transactions which cassandra does not have.
I need the counters to be exact because they are used for billing, thus I don't know if Cassandra is a good fit here, although the scalability of Cassandra seems tempting.
What about Redis for storing the counters + writing the aggregates in MySQL? Does Redis have stuff which helps me here? Or should I just store everything in a simple Java HashMap in-memory and have a aggregation background thread and don't use another technology?
In summary I am concerned about:
reduce row locking
have exact counters (transactions?)
Thanks for your ideas :)

You're sort of saying contradictory things.
The number of counts can be huge or at least unpredictable per user.
To me this means they must be uniform, like an array. It is not possible to have an unbounded number of heterogenous data, unless you have an unbounded amount of code and an unbounded number of developer hours to expend.
If they are uniform they should be flattened into a table user_counter where each row is of the form (user_id, counter_name, counter_value). However you will need to think carefully about what sort of indices you will need, etc. Updating at the beginning of the month if they are all set to zero or some default value is one SQL query.
Basically (c). (a) and (b) are most absurd and MySQL is still a suitable technology for this.

Your requirement is not so untypical. In general this is statistical session/user/... bound written data.
The first thing is to split things if not already done so. Make a mostly readonly database, and separately collect these data. So a separated user table for the normal properties.
The statistical data could be held in an in-memory table. You could also use means other than a database, a message queue, session attributes.

Best way to sort the data : DB Query or in Application Code

I have a Mysql table with some data (> million rows). I have a requirement to sort the data based on the below criteria
1) Newest
2) Oldest
3) top rated
4) least rated
What is the recommended solution to develop the sort functionality
1) For every sort reuest execute a DBQuery with required joins and orderBy conditions and return the sorted data
2) Get all the data (un sorted) from table, put the data in cache. Write custom comparators (java) to sort the data.
I am leaning towards #2 as the load on DB is only once. Moreover, application code is better than DBQuery.
Please share your thoughts....
Thanks,
Karthik

Do as much in the database as you can. Note that if you have 1,000,000 rows, returning all million is nearly useless. Are you going to display this on a web site? I think not. Do you really care about the 500,000th least popular post? Again, I think not.
So do the sorts in the database and return the top 100, 500, or 1000 rows.

It's much faster to do it in the database:
1) the database is optimized for I/O operations, and can use indices, and other DB optimizations to improve the response time
2) taking the data from the database to the application will get all data into memory. The app will have to look all the data to redorder it without optimized algorithms
3) the database only takes the minimun necessary data into mamemory, which can be much less than all the data whihc has to be moved to java
4) you can always create extra indices on the database to improve the query performance.

I would say that operation on DB will be always faster. You should ensure that caching on DB is ON and working properly. Ensure that you are not using now() in your query because it will disable mysql cache. Take a look here how mysql query cache works. In basic. Query is cached based on string so if query string differs every time you fetch no cache is used.

AFAIK usually it should run faster if you let the DB sort your data.
And regarding code on application level vs db level I would agree in the case of stored procedures but sorting in SELECTs is fine IMHO.
If you want to show the data to the user also consider paging (in which case you're better off with sorting on the db level anyway).

Fetching a million rows from the database sounds like a terrible idea. It will generate a lot of networking traffic and require quite some time to transfer all the data. Not mentioning amounts of memory you would need to allocate in your application for storing million of objects.
So if you can fetch only a subset with a query, do that. Overall, do as much filtering as you can in the database.
And I do not see any problem in ordering in a single queue. You can always use UNION if you can't do it as one SELECT.

You do not have four tasks, you have two:
sort newest IS EQUAL TO sort oldest
AND
sort top rated IS EQUAL TO sort least rated.
So you need to make two calls to db. Yes sort in db. then instead of calling to sort every time, do this:
1] track the timestamp of the latest record in the db
2] before calling to sort and retrieve entire list, check if date has changed
3] if date has not changed, use the list you have in memory
4] if date has changed, update the list

I know this is an old thread, but it comes up in my search, so I'd like to post my opinion.
I'm a bit old school, but for that many rows, I would consider dumping the data from your database (each RDBMS has it's own method. Looks like MySQLDump command for MySQL: Link )
You can then process this with sorting algorithms or tools that are available in your java libraries or operating system.
Be careful about the work your asking your database to do. Remember that it has to be available to service other requests. Don't "bring it to it's knees" servicing only one request, unless it's a nightly batch cycle type of scenario and you're certain it won't be asked to do anything else.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.