I am trying to find out find out How DB index woks and when it should be used. I read some articles on that and one important one i found is at How does database indexing work?.
How it works:-
Advantage2:- After reading the discussion at above link , the one thing index helps is it reduces the number of data blocks to iterate through as explained in example1.
Advantage1:- But again one question came to my mind , after introducing the index also it has to search the index from index table(which any data store makes internally) which should be time again. So after further reading i found out that index are stored in efficient way usually using data structure like B trees thru which can drill down to to any value quickly and after going to node it will give us the exact memory location of record for that value given in where or join condition.Correct? So basically index srores the value of record on which we are creating index and memory location of actual record.
When it should be used:- AS we know if we create index on any column and if we insert/update/delete any value for that column , index needs to be updated for that column in index table. So it will take bit extra time and memory during CUD operation. So when it should be used .Imagine we create a customer one at a time from User screen.So total customer at end of day are 1 million. Now if we want to search customer for whose belongs to NewYork.here index will help a lot. Agreed it will slow down the insert customer a bit, it will be fractionally bad, but performance we will get during retrieval for new york customer will be exceptionally good.
Please correct me if you agree/disagree with above finding?
Your general conclusions are pretty much ok.
Yes, for some queries, an index means less data blocks need to be read.
Yes, the default index type in Oracle is implemented internally using a B-Tree.
Yes, there is some overhead for Create/Update/Delete operations on a table with indexes - both in terms of performance and space used - but this overhead is usually negligible, and easily justified when the improvement to the performance of queries is considered.
I heartily recommend reading the Oracle Concepts Guide on indexes.
Previous responds (and your conclusions) are correct. With regard to when to use indexes, it might be easier to discuss when not to use indexes. Here are a couple of scenarios in which it might not be appropriate to use an index.
A table in which you do a high-rate of inserts, but never or rarely select from it. An example of such a table might be some type of logging table.
A very small table whose rows all fit into one or a couple of blocks.
Indexes speed up selects.
They do this by reducing the number of rows to check.
Example
I have a table with 1,000,000,000 rows.
id is a primary key.
gender can be either male or female
city can be one of 50 options.
street can be lots of different options.
When I'm looking for a unique value, using an index it will take 30 lookups on a fully balanced tree.
Without the index it will take 500,000,000 lookups on average.
However putting an index on gender is pointless, because it will not reduce the search time enough to justify the extra time needed to use the index, lookup the items and than get the data in the rows.
For city it is a border case. If I have 50 different cities a index is useful, if you have only 5 the index has low cardinality and will not get used.
Indexes slow down inserts and updates.
More stuff to consider
MySQL can only use one index per (sub) select per table.
If you want to use an index on:
SELECT * FROM table1 WHERE city = 'New York' AND Street = 'Hoboken'
You will have to declare a compound index:
ALTER TABLE table1 ADD INDEX index_name (city, street)
Related
Straight to the point, I've tried searching on google and on SO but cant find what I'm looking for. It could be because of not wording my searching correctly.
My question is,
I have a couple of tables which will be holding anywhere between 1,000 lines to 100,000 per year. I'm trying to figure out, do I/ how should I handle archiving the data? I'm not well experienced with databases, but below are a few method's I've came up with and I'm unsure which is a better practice. Of course taking into account performance and ease of coding. I'm using Java 1.8, Sql2o and Postgres.
Method 1
Archive the data into a separate database every year.
I don't really like this method because when we want to search for old data, our application will need to search into a different database and it'll be a hassle for me to add in more code for this.
Method 2
Archive the data into a separate database for data older than 2-3 years.
And use status on the lines to improve the performance. (See method 3) This is something I'm leaning towards as an 'Optimal' solution where the code is not as complex to do but also keeps by DB relatively clean.
Method 3
Just have status for each line (eg: A=active, R=Archived) to possibly improving the performance of the query. Just having a "select * from table where status = 'A' " to reduce the the number of line to look through.
100,000 rows per year is not that much. [1]
There's no need to move that to a separate place. If you already have good indexes in place, you almost certainly won't notice any degraded performance over the years.
However, if you want to be absolutely sure, you could add a year column and create an index for that (or add that to your existing indexes). But really, do that only for the tables where you know you need it. For example, if your table already has a date column which is part of your index(es), you don't need a separate year column.
[1] Unless you have thousands of columns and/or columns that contain large binary blobs - which doesn't seems to be the case here.
As Vog mentions, 100,000 rows is not very many. Nor is 1,000,000 or 5,000,000 -- sizes that your tables may grow to.
In many databases, you could use a clustered index where the first key is the "active" column. However, Postgres does not really support clustered indexes.
Instead, I would suggest that you look into table partitioning. This is a method where the underlying storage is split among different "files". You can easily specify that a query reads one or more partitions by using the partitioning key in a where clause.
For your particular use-case, I would further suggest having views on the data only for the active data. This would only read one partition, so the performance should be pretty much the same as reading a table with only the most recent data.
That said, I'm not sure if it is better to partition by an active flag or by year. That depends on how you are accessing the data, particularly the older data.
My hbase table contains millions of rows. If we do a scan it takes at least an hour to show all the records. We are storing date as row keys. I need to get the min and max values of date. I saw a utility org.apache.hadoop.hbase.mapreduce.RowCounter which counts millions of rows in 5 mins. Is there any way to my job in same way?. FYI: I am using java.
If you are using HBase 0.98, your problem should be easy. All you have to do is to obtain the first and the last row in your table(since the entries are ordered):
The first row you obtain by performing a scan with the limit of 1.
The last row you obtain by performing a reverse scan with the limit
of 1.
You can find more information about the reverse scan here: https://issues.apache.org/jira/browse/HBASE-4811
If you are using a previous version of HBase then you should considering using some model/convention for your table. The first row is easy to obtain(again just a scan on the table with the limit of 1), but for the last row you do not have the reverse scan feature unfortunately.
You can design to have an "upside-down" table as described here: http://staltz.blogspot.com/2012/05/first-and-last-rows-in-hbase-table.html
Since you are using date as row-key there might be high chances that you might not receive the data in a descending order manner(see the blog post on item 1.), therefore you can keep a secondary table on which you always keep the minimum and maximum values of the date(also implies that you have to perform a check in your code for every record you insert/delete and update your secondary table.
Redesign the way you store the data. A suggestion would be to keep your initial table plus a reverse-index table and in your reverse index table to store the data(on the rowkey) such as: MAX_INTEGER - dataTimestamp, therefore the latest date will be your first entry on your reverse table and you retrieve it with a scan(with the limit of 1).
Since the solution for the HBase 0.98 is very simple and no need to make workarounds, in case you do not have that version I would recommend to do a migration.
You are in the right direction. The RowCounter usage is the efficient way to count Hbase rows, which has millions of records. You can get the source code of RowCounter and tweak a bit to achieve your requirement
Rowcounter will perform scan internally. Then why is it running fast, is because of parellelism in Map reduce. Now once you have scan, I thought, you can always keep filter. So you can identify that piece of code and add filter to it.
Now with the above change, your rowcounter will count the rows, which match that filter criteria. To extend it may be, you can parameterize, column family, column qualifier, value, operator etc.
I hope it helps your cause
I have a JTable that has 50,000 rows.
Each row contains 3 columns.
The middle column contains a double (price), and reads like this.
col1 col2 col3
1.0031
1.0032
1.0033
1.0034
1.0035
I then have a constantly updating array of about 10-20 prices that gets updated every 20ms.
Im currently iterating over that array, and checking it against the 50,000 rows to find the row it should belong to, and inserting it.
Then on the next update, Im clearing those columns, and repeating.
This is extremely costly though, as each update, I have to iterate over 20 prices, that then each iterate 50,000 times to find the value of the row they should belong to.
Theres gotta be a better way to do this...
I really want to be able to just insert the price at a certain row, based on the price. (so each p rice is mapped to an index)
if price = 1.0035 insert into row X
Instead I have to do something like
If price is in one of 50,000 values, find the value index and insert.
Any ideas as the best way to achieve this??
Hashtable?? quadtree for localized search? Anything faster, because the way Im doing it is far to slow, for the application needs.
It sounds like you could let your TableModel manage a SortedMap, such as TreeMap<Double, …>, which "provides guaranteed log(n) time cost for the containsKey, get, put and remove operations." This related example manages a Map<String, String>.
A tree seems like the most logical data structure to me, however if your values are in a known range you could have an index that corresponds to each possible price, with a flag to show if the price is present. Your searches would then be and updates would be O(1) for each entry, the drawback being increased memory footprint. In essence this is a hashtable, although your hashing function could be very simple.
As for a tree, you would have to do some experimentation (or calculation) to determine the number of values in each node for your needs.
I need insert many rows from many files like:
Identifier NumberValue
For each row I am looing if already exists in database row with Identifier, if exists I will take its NumberValue and add NumberValue from arriving row and update database.
I have found that lookup in database for each row (few millions of records total) takes many time.
Does it make sense create map and look before inserting in database in this map?
Thanks.
I would get the value, add one hundred rows, and add one hundred to the NumberValue in a single transaction.
You can add an Index to the column you are searching on if it's not the Primary Key by using
#Table(indexes = { #Index( columnList = ".." ) })
So basically you're asking if it will be faster to check an in memory map of your entire database in order to potentially save the transaction cost of looking up if something exists, and if not, performing an insert to the database?
The answer of course is "maybe". Despite what you dont want to hear, it really is going to depend on the details of the database that you havent explained to us.
Is it a local one with fast access or perhaps something that's remotely accessed overseas across slow lines.
Are you running on a hefty machine where the amount of memory use really isn't an issue (else you'll end up swapping).
Does the database have indexes and primary keys in place that can quickly search and reject entries if they are duplicates?
Are these running on one server or does each server need to update what was saved to the DB to keep this in memory cache concurrent?
In general, the in memory map will make things work faster. But as I'm sure others can point out, there are a lot of issues and exceptions you'll have to deal with. Reading in a million rows in one go is probably faster than reading in a million rows one at a time in order to check if that particular identifier exists, but again, it really depends on the balance between quantity and resources and time available.
How's that for a non-answer...
How can I implement several threads with multiple/same connection(s), so that a single large table data can be downloaded in quick time.
Actually in my application, I am downloading a table having 12 lacs (1 lac = 100,000) records which takes atleast 4 hrs to download in normal connection speed and more hrs with slow connection.
So there is a need to implement several threads in Java for downloading a single table data with multiple/same connection(s) object. But no idea how to do this.
How to position a record pointer in several threads then how to add all thread records into a single large file??
Thanks in Advance
First of all, is it not advisable to fetch and download such a huge data onto the client. If you need the data for display purposes then you dont need more records that fit into your screen. You can paginate the data and fetch one page at a time. If you are fetching it and processsing in your memory then you sure would run out of memory on your client.
If at all you need to do this irrespective of the suggestion, then you can spawn multiple threads with separate connections to the database where each thread will pull a fraction of data (1 to many pages). If you have say 100K records and 100 threads available then each thread can pull 1K of records. It is again not advisable to have 100 threads with 100 open connections to the DB. This is just an example. Limit the no number of threads to some optimal value and also limit the number of records each thread is pulling. You can limit the number of records pulled from the DB on the basis of rownum.
As Vikas pointed out, if you're downloading a gigabytes of data to the client-side, you're doing something really really wrong, as he had said you should never need to download more records that can fit into your screen. If however, you only need to do this occasionally for database duplication or backup purpose, just use the database export functionality of your DBMS and download the exported file using DAP (or your favorite download accelerator).
It seems that there are multiple ways to "multi thread read from a full table."
Zeroth way: if your problem is just "I run out of RAM reading that whole table into memory" then you could try processing one row at a time somehow (or a batch of rows), then process the next batch, etc. Thus avoiding loading an entire table into memory (but still single thread so possibly slow).
First way: have a single thread query the entire table, putting individual rows onto a queue that feeds multiple worker threads [NB that setting fetch size for your JDBC connection might be helpful here if you want this first thread to go as fast as possible]. Drawback: only one thread is querying the initial DB at a time, which may not "max out" your DB itself. Pro: you're not re-running queries so sort order shouldn't change on you half way through (for instance if your query is select * from table_name, the return order is somewhat random, but if you return it all from the same resultset/query, you won't get duplicates). You won't have accidental duplicates or anything like that. Here's a tutorial doing it this way.
Second way: pagination, basically every thread somehow knows what chunk it should select (XXX in this example), so it knows "I should query the table like select * from table_name order by something start with XXX limit 10". Then each thread basically processes (in this instance) 10 at a time [XXX is a shared variable among threads incremented by the calling thread].
The problem is the "order by something" it means that for each query the DB has to order the entire table, which may or may not be possible, and can be expensive especially near the end of a table. If it's indexed this should not be a problem. The caveat here is that if there are "gaps" in the data, you'll be doing some useless queries, but they'll probably still be fast. If you have an ID column and it's mostly contiguous, you might be able to "chunk" based on ID, for instance.
If you have some other column that you can key off of, for instance a date column with a known "quantity" per date, and it is indexed, then you may be able to avoid the "order by" by instead chunking by date, for example select * from table_name where date < XXX and date > YYY (also no limit clause, though you could have a thread use limit clauses to work through a particular unique date range, updating as it goes or sorting and chunking since it's a smaller range, less pain).
Third way: you execute a query to "reserve" rows from the table, like update table_name set lock_column = my_thread_unique_key where column is nil limit 10 followed by a query select * from table_name where lock_column = my_thread_unique_key. Disadvantage: are you sure your database executes this as one atomic operation? If not then it's possible two setter queries will collide or something like that, causing duplicates or partial batches. Be careful. Maybe synchronize your process around the "select and update" queries or lock the table and/or rows appropriately. Something like that to avoid possible collision (postgres for instance requires special SERIALIZABLE option).
Fourth way: (related to third) mostly useful if you have large gaps and want to avoid "useless" queries: create a new table that "numbers" your initial table, with an incrementing ID [basically a temp table]. Then you can divide that table up by chunks of contiguous ID's and use it to reference the rows in the first. Or if you have a column already in the table (or can add one) to use just for batching purposes, you may be able to assign batch ID's to rows, like update table_name set batch_number = rownum % 20000 then each row has a batch number assigned to itself, threads can be assigned batches (or assigned "every 9th batch" or what not). Or similarly update table_name set row_counter_column=rownum (Oracle examples, but you get the drift). Then you'd have a contiguous set of numbers to batch off of.
Fifth way: (not sure if I really recommend this, but) assign each row a "random" float at insert time. Then given you know the approximate size of the database, you can peel off a fraction of it like, if 100 and you want 100 batches "where x < 0.01 and X >= 0.02" or the like. (Idea inspired by how wikipedia is able to get a "random" page--assigns each row a random float at insert time).
The thing you really want to avoid is some kind of change in sort order half way through. For instance if you don't specify a sort order, and just query like this select * from table_name start by XXX limit 10 from multiple threads, it's conceivably possible that the database will [since there is no sort element specified] change the order it returns you rows half way through [for instance, if new data is added] meaning you may skip rows or what not.
Using Hibernate's ScrollableResults to slowly read 90 million records also has some related ideas (esp. for hibernate users).
Another option is if you know some column (like "id") is mostly contiguous, you can just iterate through that "by chunks" (get the max, then iterate numerically over chunks). Or some other column that is "chunkable" as it were.
I just felt compelled to answer on this old posting.
Note that this is a typical scenario for Big Data, not only to acquire the data in multiple threads, but also to further process that data in multiple threads. Such approaches do not always call for all data to be accumulated in memory, it can be processed in groups and/or sliding windows, and only need to either accumulate a result, or pass the data further on (other permanent storage).
To process the data in parallel, typically a partitioning scheme or a splitting scheme is applied to the source data. If the data is raw textual, this could be a random sizer cut somewhere in the middle. For databases, the partitioning scheme is nothing but an extra where condition applied on your query to allow paging. This could be something like:
Driver Program: Split my data in for parts, and start 4 workers
4 x (Worker Program): Give me part 1..4 of 4 of the data
This could translate into a (pseudo) sql like:
SELECT ...
FROM (... Subquery ...)
WHERE date = SYSDATE - days(:partition)
In the end it is all pretty conventional, nothing super advanced.