Table like data structure for query handling and mathematical operations - java

I want to have a table like representation of data with multiple columns. e.g. consider following sample:
---------------------------------------------------------------
col1 col2 col3 col4 col5(numeric) col6(numeric)
---------------------------------------------------------------
val01 val02 val03 val04 05 06
val11 val12 val13 val14 15 16
val21 val22 val23 val24 25 26
val31 val32 val33 val34 35 36
.
.
.
---------------------------------------------------------------
I'd like to query on this table by a value in given col e.g. search for value val32 in column col2 which should return me all rows that could match this query in the same tabular format.
for some columns like say col5 and col6, I'd like to perform mathematical operations/queries like getMax(), getMin(), getSum(), divideAll() etc...
For such requirement can anybody suggest any type of data structure that could best solve my purpose? Any one data structure or combination of them, Considering efficient operations (like mathematical examples above), and querying??
Let me know if anybody need more information.
Edit: Additional requirement
This should be efficient enough to handle hundreds of millions of rows and also easy and efficient to persist.

What you need is a three-part approach:
A Row class that contains fields for each column
A List<Row> to store the rows and provide sequential access
One or more Map<String,Row> or Map<Integer,Row> to provide fast lookup of the rows by various column values. If the column values are not unique then you need a MultiMap<...> implementation (there are several available on the Internet) to allow multiple values for a given key.
The Row objects are first placed in the list, and then you build the index(es) after you have loaded all the rows.

I think below should help:
Map<String,List<Object>>
Search "val32" in "col2", search(cal2,val32):
get the list of the objects associated with cal2(map.get("cal2"),and iterate over them to find if this value exists or not.
getSum(String columnName):
Again just get the list, iterate over it add these values. Return the final sum.
Since you are adding List of Objects, you might want to throw ClassCasteException from these APIs.

Finally I planned to use Mongo Database instead of going through all basic and complicated implementations..
I hope this will solve my problem. Or there is any other db better that this in terms of speed, storage, and availability of required operations (as mentioned in question)?

Related

Most efficient way to compare two datasets and find results

If I have two data sets which come from SQL tables which appear like this. Where table A contains 3 possible values for a given item and Table B containts a full path to a file name,
I have two data sets which come from SQL tables which appear like this.
TABLE A:
Column1 Column2 Column3
Value SecondValue ThirdValue
Value2 SecondValue2 ThirdValue2
Value3 SecondValue3 ThirdValue3
Table B:
Column1
PathToFile1\value.txt
PathToFile2\SecondValue2_ThirdValue.txt
PathToFile3\ThirdValue3_Value3.txt
I can extract any of the tables/columns to text, and I will use Java to find the full path (Table B) which contains any combination of the values in a row from (Table A).
Table B can have values such as c:\directory\file.txt, c:\directory\directory2\filename.txt or c:\filename.txt
What is the most efficient way to search for the paths, given the filename?
I have two ideas from coworkers, but I am not sure if they are the optimal solution.
1.Store the filename and path parsed from Table B in a hash map and then look up the paths using the values from A as the key. Doing this for each column of A.
2.Sort both alphabetically and do a binary-search using alphabetic order.
CLARIFICATION:
The path to the file in Table B can contain any one of the values from the columns in Table A. That is how they relate. The output has to run eventually in Java and I wanted to explore the options in Java, knowing SQL would be faster for relating the data. Also added some info to the table section. Please let me know if more info is needed.
I found this to help along with my answer, although not a specific answer to my question. I think using the info in this article can lead to the optimal practice.
http://www.javacodegeeks.com/2010/08/java-best-practices-vector-arraylist.html

Better to query once, then organize objects based on returned column value, or query twice with different conditions?

I have a table which I need to query, then organize the returned objects into two different lists based on a column value. I can either query the table once, retrieving the column by which I would differentiate the objects and arrange them by looping through the result set, or I can query twice with two different conditions and avoid the sorting process. Which method is generally better practice?
MY_TABLE
NAME AGE TYPE
John 25 A
Sarah 30 B
Rick 22 A
Susan 43 B
Either SELECT * FROM MY_TABLE, then sort in code based on returned types, or
SELECT NAME, AGE FROM MY_TABLE WHERE TYPE = 'A' followed by
SELECT NAME, AGE FROM MY_TABLE WHERE TYPE = 'B'
Logically, a DB query from a Java code will be more expensive than a loop within the code because querying the DB involves several steps such as connecting to DB, creating the SQL query, firing the query and getting the results back.
Besides, something can go wrong between firing the first and second query.
With an optimized single query and looping with the code, you can save a lot of time than firing two queries.
In your case, you can sort in the query itself if it helps:
SELECT * FROM MY_TABLE ORDER BY TYPE
In future if there are more types added to your table, you need not fire an additional query to retrieve it.
It is heavily dependant on the context. If each list is really huge, I would let the database to the hard part of the job with 2 queries. At the opposite, in a web application using a farm of application servers and a central database I would use one single query.
For the general use case, IMHO, I will save database resource because it is a current point of congestion and use only only query.
The only objective argument I can find is that the splitting of the list occurs in memory with a hyper simple algorithm and in a single JVM, where each query requires a bit of initialization and may involve disk access or loading of index pages.
In general, one query performs better.
Also, with issuing two queries you can potentially get inconsistent results (which may be fixed with higher transaction isolation level though ).
In any case I believe you still need to iterate through resultset (either directly or by using framework's methods that return collections).
From the database point of view, you optimally have exactly one statement that fetches exactly everything you need and nothing else. Therefore, your first option is better. But don't generalize that answer in way that makes you query more data than needed. It's a common mistake for beginners to select all rows from a table (no where clause) and do the filtering in code instead of letting the database do its job.
It also depends on your dataset volume, for instance if you have a large data set, doing a select * without any condition might take some time, but if you have an index on your 'TYPE' column, then adding a where clause will reduce the time taken to execute the query. If you are dealing with a small data set, then doing a select * followed with your logic in the java code is a better approach
There are four main bottlenecks involved in querying a database.
The query itself - how long the query takes to execute on the server depends on indexes, table sizes etc.
The data volume of the results - there could be hundreds of columns or huge fields and all this data must be serialised and transported across the network to your client.
The processing of the data - java must walk the query results gathering the data it wants.
Maintaining the query - it takes manpower to maintain queries, simple ones cost little but complex ones can be a nightmare.
By careful consideration it should be possible to work out a balance between all four of these factors - it is unlikely that you will get the right answer without doing so.
You can query by two conditions:
SELECT * FROM MY_TABLE WHERE TYPE = 'A' OR TYPE = 'B'
This will do both for you at once, and if you want them sorted, you could do the same, but just add an order by keyword:
SELECT * FROM MY_TABLE WHERE TYPE = 'A' OR TYPE = 'B' ORDER BY TYPE ASC
This will sort the results by type, in ascending order.
EDIT:
I didn't notice that originally you wanted two different lists. In that case, you could just do this query, and then find the index where the type changes from 'A' to 'B' and copy the data into two arrays.

Compare very large tables in java

I am not able to find any satisfying solution so asking here.
I need to compare data of two large tables(~50M) with the same schema definition in JAVA.
I can not use order by clause while getting the resultset object and records might be not in order in both of the tables.
Can anyone help me what can be the right way to do it?
You could extract the data of the first DB table into a text file, and create a while loop on the resultSet for the 2nd table. As you iterate through the ResultSet do a search/verify against the text file. This solution works if memory is of concern to you.
If not, then just use a HashMap to hold the data for the first table and do the while loop and look up the records of the 2nd table from the HashMap.
This really depends on what you mean by 'compare'? Are you trying to see if they both contain the exact same data? Find rows in one not in the other? Find rows with the same primary keys that have differing values?
Also, why do you have to do this in Java? Regardless of what exactly you are trying to do, it's probably easier to do with SQL.
In Java, you'll want to create an class that represents the primary key for the tables, and a second classthat represents the rest of the data, which also includes the primary key class. If you only have a single column as the primary key, then this is easier.
We'll call P the primary key class, and D the rest.
Map map = new HashMap();
Select all of the rows from the first table, and insert them into the hash map.
Query all of the rows in the second table.
For each row, create a P object.
Use that to see what data was in the first table with the same Key.
Now you know if both tables contained the same row, and you can compare the non-key values from both both.
Like I said, this is much much easier to do in straight SQL.
You basically do a full outer join between the two tables. How exactly that join looks depends on exactly what you are trying to do.

Can astyanax return ordered column names?

Using com.netflix.astyanax, I add entries for a given row as follows:
final ColumnListMutation<String> columnList = m.withRow(columnFamily, key);
columnList.putEmptyColumn(columnName);
Later I retrieve all my columns with:
final OperationResult<ColumnList<String>> operationResult = keyspace
.prepareQuery(columnFamily).getKey(key).execute();
operationResult.getResult().getColumnNames();
The following correctly return all the columns I have added but the columns are not ordered accordingly to when they were entered in the database. Since each column has a timestamp associated to it, there ought to be a way to do exactly this but I don't see it. Is there?
Note: If there isn't, I can always change the code above to:
columnList.putColumn(ip,new Date());
and then retrieve the column values, order them accordingly, but that seems cumbersome, inefficient, and silly since each column already has a timestamp.
I know from PlayOrm that if you do column Slices, it returns those in order. In fact, playorm uses that do enable S-SQL in partitions and basically batches the column slicing which comes back in order or reverse order depending on how requested. You may want to do a column slice from 0 to MAXLONG.
I am not sure about getting the row though. I haven't tried that.
oh, and PlayOrm is just a mapping layer on top of astyanax though not really relational and more noSql'ish really as demonstrated by it's patterns pages
http://buffalosw.com/wiki/Patterns-Page/
Cassandra will never order your columns in "insertion order".
Columns are always ordered lowest first. It also depends on how cassandra interprets your column names. You can define the interpretation with the comparator you set when defining your column family.
From what you gave it looks you use String timestamp values. If you simply serialized your timestamps as e.g. "123141" and "231" be aware that with an UTF8Type comparator "231">"123131".
Better approach: Use Time-based UUIDs as column names, as many examples for Time-series data in Cassandra propose. Then you can use the UUIDType comparator.
CREATE COLUMN FAMILY timeseries_data
WITH comparator = UUIDType
AND key_validation_class=UTF8Type;

insert data into a lot of columns in java JDBC

I have a table with 50 columns and I want to insert all items in a HashMap variable into it (HashMap keys and table column names are the same).
How can I do that without writing 50 lines of code?
Get the key set for the HashMap. Iterate that key set to build a String containing your insert statement. Use the resulting String to create a PreparedStatement. Then iterate that key set again to set parameters by name using the Objects you retrieve from the HashMap.
You might have to write a few extra lines of special-case code if any of your values are of a Class that the JDBC driver isn't sure how to map.
I'd suggest you bite the dust and simply write a method that will do the dirty work for you containing 50 lines of parameter setting code. This isn't so bad, and you only have to write it once. I hope you aren't that lazy ;-)
And by the way, isn't 50 columns in a table a bit much? Perhaps a normalization process could help and lower complexity of your database and the code that will manipulate it.
Another way to go is to use an ORM like Hibernate, or a more lightweight approach like Spring JDBC template.
Call map.keySet() to get the name of all columns.
Create an INSERT statement by iterating the key set.
The column is from an item (a key) in the key set.
The data is from map.get(key).

Categories