Most efficient way to compare two datasets and find results

Most efficient way to compare two datasets and find results - java

If I have two data sets which come from SQL tables which appear like this. Where table A contains 3 possible values for a given item and Table B containts a full path to a file name,
I have two data sets which come from SQL tables which appear like this.
TABLE A:
Column1 Column2 Column3
Value SecondValue ThirdValue
Value2 SecondValue2 ThirdValue2
Value3 SecondValue3 ThirdValue3
Table B:
Column1
PathToFile1\value.txt
PathToFile2\SecondValue2_ThirdValue.txt
PathToFile3\ThirdValue3_Value3.txt
I can extract any of the tables/columns to text, and I will use Java to find the full path (Table B) which contains any combination of the values in a row from (Table A).
Table B can have values such as c:\directory\file.txt, c:\directory\directory2\filename.txt or c:\filename.txt
What is the most efficient way to search for the paths, given the filename?
I have two ideas from coworkers, but I am not sure if they are the optimal solution.
1.Store the filename and path parsed from Table B in a hash map and then look up the paths using the values from A as the key. Doing this for each column of A.
2.Sort both alphabetically and do a binary-search using alphabetic order.
CLARIFICATION:
The path to the file in Table B can contain any one of the values from the columns in Table A. That is how they relate. The output has to run eventually in Java and I wanted to explore the options in Java, knowing SQL would be faster for relating the data. Also added some info to the table section. Please let me know if more info is needed.

I found this to help along with my answer, although not a specific answer to my question. I think using the info in this article can lead to the optimal practice.
http://www.javacodegeeks.com/2010/08/java-best-practices-vector-arraylist.html

Related

How to return no matched row in Pentaho Data Inegration (Kettle)?

I look for a solution to perform SSIS lookup in Pentaho Data Integration.
I'll try to explain with an exemple :
I have two tables A and B.
Here , data in table A :
1
2
3
4
5
Here , data in table B:
3
4
5
6
7
After my process :
All rows in A and not in B ==> will be insert to B
All rows in B and not in A ==> will be deleted to A
So , here my final Table B :
3
4
5
1
2
someone can help me please ?

There is indeed a step that does this, but it doesn't do it alone. It's the Merge rows(diff) step and it has some requirements. In your case, A is the "compare" table and B is the "reference" table.
First of all, both inputs (rows from A and B in your case, Dev and Prod in mine) need to be sorted by a key value. In the step you specify the key fields to match on, and then the value fields to compare. The step adds a field to the output (by default called 'flagfield'). After comparing each row, this field is given one of four values: "new", "changed", "deleted", or "identical". Note in my example below I have explicit sort steps. That's because the sorting scheme of my database is not compatible with PDI's, and for this step to work, your data must be in PDI's sort order. You may not need these.
You can follow this with a Synchronize after merge step to apply the identified changes. In this step you specify the flagfield and the values that correspond to insert, update, and delete. FYI these are specified on the "Advanced" tab, and they must be filled out for the step to work.
For a very small table like your example, I would favor just a truncate and full load with a Table output step, but if your tables are large and the number of changes relatively small (<= ~25%) and replication is not available, this step is usually the way to go.

In Pentaho direct step is not availble. There are so many ways to do these.
=> Writing sql's to achieve your solution. If you write sql's execution speed also faster.
=> Using filter step also you can acheive.
Thank you.

Mapping partial column values using Toplink

Suppose I have a list of IDs as follows:
EmployeeID
-------
ABCD
AECD
ABDF
ACDF
ACDE
I have a need to read the distinct values from a list of codes, while selecting only the first two characters of the column.
In other words, its similar to using the following query:
SELECT DISTINCT LEFT (EmployeeID,2) FROM TABLE1
My question is how do I map such a field in TOPLINK.
Note:I have created a class for the EmployeeID, but dont have an idea of mapping a partial field.

Ok... After looking at many workarounds, I seem to have a more suited solution.
I created an object for this particular scenario (the POJO has only the field for the holding the 2 Char ID, and its getter and setter methods).
During the mapping, I mapped the above field to the DB column in question (EmployeeID in the table described above).
Now I selected "Custom Queries" for the above object and entered the following query for "Read all" tab.
SELECT DISTINCT LEFT (EmployeeID,2) AS EmploeeID FROM TABLE1
All the read all operations on the object will now return the list of distinct first 2 characters of IDs.
Welcome anyone's opinion on this.

Compare very large tables in java

I am not able to find any satisfying solution so asking here.
I need to compare data of two large tables(~50M) with the same schema definition in JAVA.
I can not use order by clause while getting the resultset object and records might be not in order in both of the tables.
Can anyone help me what can be the right way to do it?

You could extract the data of the first DB table into a text file, and create a while loop on the resultSet for the 2nd table. As you iterate through the ResultSet do a search/verify against the text file. This solution works if memory is of concern to you.
If not, then just use a HashMap to hold the data for the first table and do the while loop and look up the records of the 2nd table from the HashMap.

This really depends on what you mean by 'compare'? Are you trying to see if they both contain the exact same data? Find rows in one not in the other? Find rows with the same primary keys that have differing values?
Also, why do you have to do this in Java? Regardless of what exactly you are trying to do, it's probably easier to do with SQL.
In Java, you'll want to create an class that represents the primary key for the tables, and a second classthat represents the rest of the data, which also includes the primary key class. If you only have a single column as the primary key, then this is easier.
We'll call P the primary key class, and D the rest.
Map map = new HashMap();
Select all of the rows from the first table, and insert them into the hash map.
Query all of the rows in the second table.
For each row, create a P object.
Use that to see what data was in the first table with the same Key.
Now you know if both tables contained the same row, and you can compare the non-key values from both both.
Like I said, this is much much easier to do in straight SQL.
You basically do a full outer join between the two tables. How exactly that join looks depends on exactly what you are trying to do.

Table like data structure for query handling and mathematical operations

I want to have a table like representation of data with multiple columns. e.g. consider following sample:
---------------------------------------------------------------
col1 col2 col3 col4 col5(numeric) col6(numeric)
---------------------------------------------------------------
val01 val02 val03 val04 05 06
val11 val12 val13 val14 15 16
val21 val22 val23 val24 25 26
val31 val32 val33 val34 35 36
.
.
.
---------------------------------------------------------------
I'd like to query on this table by a value in given col e.g. search for value val32 in column col2 which should return me all rows that could match this query in the same tabular format.
for some columns like say col5 and col6, I'd like to perform mathematical operations/queries like getMax(), getMin(), getSum(), divideAll() etc...
For such requirement can anybody suggest any type of data structure that could best solve my purpose? Any one data structure or combination of them, Considering efficient operations (like mathematical examples above), and querying??
Let me know if anybody need more information.
Edit: Additional requirement
This should be efficient enough to handle hundreds of millions of rows and also easy and efficient to persist.

What you need is a three-part approach:
A Row class that contains fields for each column
A List<Row> to store the rows and provide sequential access
One or more Map<String,Row> or Map<Integer,Row> to provide fast lookup of the rows by various column values. If the column values are not unique then you need a MultiMap<...> implementation (there are several available on the Internet) to allow multiple values for a given key.
The Row objects are first placed in the list, and then you build the index(es) after you have loaded all the rows.

I think below should help:
Map<String,List<Object>>
Search "val32" in "col2", search(cal2,val32):
get the list of the objects associated with cal2(map.get("cal2"),and iterate over them to find if this value exists or not.
getSum(String columnName):
Again just get the list, iterate over it add these values. Return the final sum.
Since you are adding List of Objects, you might want to throw ClassCasteException from these APIs.

Finally I planned to use Mongo Database instead of going through all basic and complicated implementations..
I hope this will solve my problem. Or there is any other db better that this in terms of speed, storage, and availability of required operations (as mentioned in question)?

Duplication Algorithm in Java

I am looking for some duplicate matching algorithm in Java.I have senario i.e
I have two tables.Table 1 contain 25,000 records strings within one coloumn and similarly Table 2 contain 20,000 records strings.
I want to check duplicate records in both table 1 and table 2.
Records are like this format for example:
Table 1
Jhon,voltra
Bruce willis
Table 2
voltra jhon
bruce, willis
Looking for algoirthm which can find this type of duplicate string machting from these two tables in two different files.
Can some you help me about two or more algorithm which can perform such queries in Java.

Read the two files into a normalised form so they can be compared. Use Set of these entries and retainAll() to find the intersection of these two sets. These are the duplicates.

You can use a Map<String, Integer> (e.g. HashMap) and read the files line by line and insert the strings into the map, incrementing the value each time you find an existing entry.
You can then search through your map and find all entries with a count > 1.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.