What data structure should I use for this pricing table - java

I have a JTable that has 50,000 rows.
Each row contains 3 columns.
The middle column contains a double (price), and reads like this.
col1 col2 col3
1.0031
1.0032
1.0033
1.0034
1.0035
I then have a constantly updating array of about 10-20 prices that gets updated every 20ms.
Im currently iterating over that array, and checking it against the 50,000 rows to find the row it should belong to, and inserting it.
Then on the next update, Im clearing those columns, and repeating.
This is extremely costly though, as each update, I have to iterate over 20 prices, that then each iterate 50,000 times to find the value of the row they should belong to.
Theres gotta be a better way to do this...
I really want to be able to just insert the price at a certain row, based on the price. (so each p rice is mapped to an index)
if price = 1.0035 insert into row X
Instead I have to do something like
If price is in one of 50,000 values, find the value index and insert.
Any ideas as the best way to achieve this??
Hashtable?? quadtree for localized search? Anything faster, because the way Im doing it is far to slow, for the application needs.

It sounds like you could let your TableModel manage a SortedMap, such as TreeMap<Double, …>, which "provides guaranteed log(n) time cost for the containsKey, get, put and remove operations." This related example manages a Map<String, String>.

A tree seems like the most logical data structure to me, however if your values are in a known range you could have an index that corresponds to each possible price, with a flag to show if the price is present. Your searches would then be and updates would be O(1) for each entry, the drawback being increased memory footprint. In essence this is a hashtable, although your hashing function could be very simple.
As for a tree, you would have to do some experimentation (or calculation) to determine the number of values in each node for your needs.

Related

EclipseLink add vs. update performance

Can anybody give me an intuition for the following situation?
We are modifying our Microsoft SQL server with EclipseLink. Our application contains an import functionality, so that from these imports a lot of insertions into a table (say table A) are generated. On the other hand, when we delete one of these entries of the table, we do not actually delete it but set a "deleted" flag to true. Among different tables (say tables X,Y,Z), there are relations, so that if we delete an element from X, we also delete some elements of A. Important is that we insert and delete every element separately for now. Given a lot of imports and and a lot of such bulk updates (our deletions), our table A grows a lot, so that currently it is of size 500'000.
Using the EclipseLink PerformanceProfiler, I found out that we get a slow-down in update times, but no visible slow-down for the insertion time. Since we have an index on the composite primary key of table A, I expect that insertion needs O(1) time for index insertion (assuming that the index is similar to a HashMap) and something like O(1) to insert into the table. For the update, I get O(1) for retrieval, and O(1) for the write.
In reality we get an insertion time of 0.6 ms for one element, but an update time of 800ms (!). Does anybody have an explanation for this? Also, if anybody knows good measures to improve this situation, I am happy to hear those as well.
So far I am only aware of bulk updating those elements together that meet a certain condition such as: update all elements of table A that are related to element x of table X. But since we have a lot more hierarchy over multiple tables, I am not sure how much I get from this.

How to process huge data in hbase by modifying org.apache.hadoop.hbase.mapreduce.RowCounter?

My hbase table contains millions of rows. If we do a scan it takes at least an hour to show all the records. We are storing date as row keys. I need to get the min and max values of date. I saw a utility org.apache.hadoop.hbase.mapreduce.RowCounter which counts millions of rows in 5 mins. Is there any way to my job in same way?. FYI: I am using java.
If you are using HBase 0.98, your problem should be easy. All you have to do is to obtain the first and the last row in your table(since the entries are ordered):
The first row you obtain by performing a scan with the limit of 1.
The last row you obtain by performing a reverse scan with the limit
of 1.
You can find more information about the reverse scan here: https://issues.apache.org/jira/browse/HBASE-4811
If you are using a previous version of HBase then you should considering using some model/convention for your table. The first row is easy to obtain(again just a scan on the table with the limit of 1), but for the last row you do not have the reverse scan feature unfortunately.
You can design to have an "upside-down" table as described here: http://staltz.blogspot.com/2012/05/first-and-last-rows-in-hbase-table.html
Since you are using date as row-key there might be high chances that you might not receive the data in a descending order manner(see the blog post on item 1.), therefore you can keep a secondary table on which you always keep the minimum and maximum values of the date(also implies that you have to perform a check in your code for every record you insert/delete and update your secondary table.
Redesign the way you store the data. A suggestion would be to keep your initial table plus a reverse-index table and in your reverse index table to store the data(on the rowkey) such as: MAX_INTEGER - dataTimestamp, therefore the latest date will be your first entry on your reverse table and you retrieve it with a scan(with the limit of 1).
Since the solution for the HBase 0.98 is very simple and no need to make workarounds, in case you do not have that version I would recommend to do a migration.
You are in the right direction. The RowCounter usage is the efficient way to count Hbase rows, which has millions of records. You can get the source code of RowCounter and tweak a bit to achieve your requirement
Rowcounter will perform scan internally. Then why is it running fast, is because of parellelism in Map reduce. Now once you have scan, I thought, you can always keep filter. So you can identify that piece of code and add filter to it.
Now with the above change, your rowcounter will count the rows, which match that filter criteria. To extend it may be, you can parameterize, column family, column qualifier, value, operator etc.
I hope it helps your cause

How DB Index Works?

I am trying to find out find out How DB index woks and when it should be used. I read some articles on that and one important one i found is at How does database indexing work?.
How it works:-
Advantage2:- After reading the discussion at above link , the one thing index helps is it reduces the number of data blocks to iterate through as explained in example1.
Advantage1:- But again one question came to my mind , after introducing the index also it has to search the index from index table(which any data store makes internally) which should be time again. So after further reading i found out that index are stored in efficient way usually using data structure like B trees thru which can drill down to to any value quickly and after going to node it will give us the exact memory location of record for that value given in where or join condition.Correct? So basically index srores the value of record on which we are creating index and memory location of actual record.
When it should be used:- AS we know if we create index on any column and if we insert/update/delete any value for that column , index needs to be updated for that column in index table. So it will take bit extra time and memory during CUD operation. So when it should be used .Imagine we create a customer one at a time from User screen.So total customer at end of day are 1 million. Now if we want to search customer for whose belongs to NewYork.here index will help a lot. Agreed it will slow down the insert customer a bit, it will be fractionally bad, but performance we will get during retrieval for new york customer will be exceptionally good.
Please correct me if you agree/disagree with above finding?
Your general conclusions are pretty much ok.
Yes, for some queries, an index means less data blocks need to be read.
Yes, the default index type in Oracle is implemented internally using a B-Tree.
Yes, there is some overhead for Create/Update/Delete operations on a table with indexes - both in terms of performance and space used - but this overhead is usually negligible, and easily justified when the improvement to the performance of queries is considered.
I heartily recommend reading the Oracle Concepts Guide on indexes.
Previous responds (and your conclusions) are correct. With regard to when to use indexes, it might be easier to discuss when not to use indexes. Here are a couple of scenarios in which it might not be appropriate to use an index.
A table in which you do a high-rate of inserts, but never or rarely select from it. An example of such a table might be some type of logging table.
A very small table whose rows all fit into one or a couple of blocks.
Indexes speed up selects.
They do this by reducing the number of rows to check.
Example
I have a table with 1,000,000,000 rows.
id is a primary key.
gender can be either male or female
city can be one of 50 options.
street can be lots of different options.
When I'm looking for a unique value, using an index it will take 30 lookups on a fully balanced tree.
Without the index it will take 500,000,000 lookups on average.
However putting an index on gender is pointless, because it will not reduce the search time enough to justify the extra time needed to use the index, lookup the items and than get the data in the rows.
For city it is a border case. If I have 50 different cities a index is useful, if you have only 5 the index has low cardinality and will not get used.
Indexes slow down inserts and updates.
More stuff to consider
MySQL can only use one index per (sub) select per table.
If you want to use an index on:
SELECT * FROM table1 WHERE city = 'New York' AND Street = 'Hoboken'
You will have to declare a compound index:
ALTER TABLE table1 ADD INDEX index_name (city, street)

Avoiding for loop and try to utilize collection APIs instead (performance)

I have a piece of code from an old project.
The logic (in a high level) is as follows:
The user sends a series of {id,Xi} where id is the primary key of the object in the database.
The aim is that the database is updated but the series of Xi values is always unique.
I.e. if the user sends {1,X1} and in the database we have {1,X2},{2,X1} the input should be rejected otherwise we end up with duplicates i.e. {1,X1},{2,X1} i.e. we have X1 twice in different rows.
In lower level the user sends a series of custom objects that encapsulate this information.
Currently the implementation for this uses "brute-force" i.e. continuous for-loops over input and jdbc resultset to ensure uniqueness.
I do not like this approach and moreover the actual implementation has subtle bugs but this is another story.
I am searching for a better approach, both in terms of coding and performance.
What I was thinking is the following:
Create a Set from the user's input list. If the Set has different size than list, then user's input has duplicates.Stop there.
Load data from jdbc.
Create a HashMap<Long,String> with the user's input. The key is the primary key.
Loop over result set. If HashMap does not contain a key with the same value as ResultSet's row id then add it to HashMap
In the end get HashMap's values as a List.If it contains duplicates reject input.
This is the algorithm I came up.
Is there a better approach than this? (I assume that I am not erroneous on the algorithm it self)
Purely from performance point of view , why not let the database figure out that there are duplicates ( like {1,X1},{2,X1} ) ? Have a unique constraint in place in the table and then when the update statement fails by throwing the exception , catch it and deal with what you would want to do under these input conditions. You may also want to run this as a single transaction just if you need to rollback any partial updates. Ofcourse this is assuming that you dont have any other business rules driving the updates that you havent mentioned here.
With your algorithm , you are spending too much time iterating over HashMaps and Lists to remove duplicates IMHO.
Since you can't change the database, as stated in the comments. I would probably extend out your Set idea. Create a HashMap<Long, String> and put all of the items from the database in it, then also create a HashSet<String> with all of the values from your database in it.
Then as you go through the user input, check the key against the hashmap and see if the values are the same, if they are, then great you don't have to do anything because that exact input is already in your database.
If they aren't the same then check the value against the HashSet to see if it already exists. If it does then you have a duplicate.
Should perform much better than a loop.
Edit:
For multiple updates perform all of the updates on the HashMap created from your database then once again check the Map's value set to see if its' size is different from the key set.
There might be a better way to do this, but this is the best I got.
I'd opt for a database-side solution. Assuming a table with the columns id and value, you should make a list with all the "values", and use the following SQL:
select count(*) from tbl where value in (:values);
binding the :values parameter to the list of values however is appropriate for your environment. (Trivial when using Spring JDBC and a database that supports the in operator, less so for lesser setups. As a last resort you can generate the SQL dynamically.) You will get a result set with one row and one column of a numeric type. If it's 0, you can then insert the new data; if it's 1, report a constraint violation. (If it's anything else you have a whole new problem.)
If you need to check for every item in the user input, change the query to:
select value from tbl where value in (:values)
store the result in a set (called e.g. duplicates), and then loop over the user input items and check whether the value of the current item is in duplicates.
This should perform better than snarfing the entire dataset into memory.

How to optimize retrieval of most occurring values (hundreds of millions of rows)

I'm trying to retrieve some most occurring values from a SQLite table containing a few hundreds of millions of rows.
The query so far may look like this:
SELECT value, COUNT(value) AS count FROM table GROUP BY value ORDER BY count DESC LIMIT 10
There is a index on the value field.
However, with the ORDER BY clause, the query takes so much time I've never seen the end of it.
What could be done to drastically improve such queries on such big amount of data?
I tried to add a HAVING clause (e.g: HAVING count > 100000) to lower the number of rows to be sorted, without success.
Note that I don't care much on the time required to do the insertion (it still need to be reasonable, but priority is given to the selection), so I'm opened for solutions suggesting computation at insertion time ...
Thanks in advance,
1) create a new table where you'll store one row per unique "value" and the "count", put a descending index on the count column
2) add a trigger to the original table, where you maintain this new table (inset and update) as necessary to increment/decrement the count.
3) run your query off this new table, which will run fast because of the descending count index
this query forces you to look at every row in the table. that is what is taking time.
I almost never recommend this, but in this case, you could maintain the count in a denormalized fashion in an external table.
place the value and count into another table during insert, update, and delete via triggers.

Categories