Duplication Algorithm in Java - java

I am looking for some duplicate matching algorithm in Java.I have senario i.e
I have two tables.Table 1 contain 25,000 records strings within one coloumn and similarly Table 2 contain 20,000 records strings.
I want to check duplicate records in both table 1 and table 2.
Records are like this format for example:
Table 1
Jhon,voltra
Bruce willis
Table 2
voltra jhon
bruce, willis
Looking for algoirthm which can find this type of duplicate string machting from these two tables in two different files.
Can some you help me about two or more algorithm which can perform such queries in Java.

Read the two files into a normalised form so they can be compared. Use Set of these entries and retainAll() to find the intersection of these two sets. These are the duplicates.

You can use a Map<String, Integer> (e.g. HashMap) and read the files line by line and insert the strings into the map, incrementing the value each time you find an existing entry.
You can then search through your map and find all entries with a count > 1.

Related

Data Structure choices based on requirements

I'm completely new to programming and to java in particular and I am trying to determine which data structure to use for a specific situation. Since I'm not familiar with Data Structures in general, I have no idea what structure does what and what the limitations are with each.
So I have a CSV file with a bunch of items on it, lets say Characters and matching Numbers. So my list looks like this:
A,1,B,2,B,3,C,4,D,5,E,6,E,7,E,8,E,9,F,10......etc.
I need to be able to read this in, and then:
1)display just the letters or just the numbers sorted alphabetically or numerically
2)search to see if an element is contained in either list.
3)search to see if an element pair (for example A - 1 or B-10) is contained in the matching list.
Think of it as an excel spreadsheet with two columns. I need to be able to sort by either column while maintaining the relationship and I need to be able to do an IF column A = some variable AND the corresponding column B contains some other variable, then do such and such.
I need to also be able to insert a pair into the original list at any location. So insert A into list 1 and insert 10 into list 2 but make sure they retain the relationship A-10.
I hope this makes sense and thank you for any help! I am working on purchasing a Data Structures in Java book to work through and trying to sign up for the class at our local college but its only offered every spring...
You could use two sorted Maps such as TreeMap.
One would map Characters to numbers (Map<Character,Number> or something similar). The other would perform the reverse mapping (Map<Number, Character>)
Let's look at your requirements:
1)display just the letters or just the numbers sorted alphabetically
or numerically
Just iterate over one of the maps. The iteration will be ordered.
2)search to see if an element is contained in either list.
Just check the corresponding map. Looking for a number? Check the Map whose keys are numbers.
3)search to see if an element pair (for example A - 1 or B-10) is
contained in the matching list.
Just get() the value for A from the Character map, and check whether that value is 10. If so, then A-10 exists. If there's no value, or the value is not 10, then A-10 doesn't exist.
When adding or removing elements you'd need to take care to modify both maps to keep them in sync.

How can I split the string elements into disjoint groups in java?

The lines are as follows
A1;B1;C1
A2;B2;C2
How to find a set of unique strings and break it into non-intersecting groups by the following criterion: if two lines have coincidences of non-empty values in one or more columns, they belong to the same group. For example, lines
1,2,3
4,5,6
1,5,7
belong to one group.
Initially I thought to make through a three of HashSet (for each column) to quickly see if the string is included in the list of unique values, then adding either to the list of already grouped rows or to the list of unique rows. But the algorithm in this case has a performance bottleneck: if you want to merge groups, you must go through each group in the list. Algorithm on a large amount of data (> 1 million records), with a large number of mergers, works slowly. If the mergers are small (about thousands), it works quickly. I caught the stuck in this place and do not know how to optimize this bottleneck or whether it is necessary to use other data structures and algorithms. Can someone tell me which direction to dig. I will be grateful for any thoughts on this matter.
I'd suggest the following approach:
Create a Set<String> ungroupedLines, initially containing all the lines. You'll remove the lines as you assign them to groups.
Build three Map<String, Collection<String>>-s, as you've suggested, one per column.
Initialize an empty Collection<Collection<String>> result.
While ungroupedLines is not empty:
Create a new Collection<String> group.
Remove an element, add it to group.
Perform "depth-first search" from that element, using your three maps.
Ignore (skip) any elements that have already been removed from your ungroupedLines.
For the rest, remove them from ungroupedLines and add them to group before recursing on them.
Alternatively, you can use breadth-first search.
Add group to result.

Most efficient way to compare two datasets and find results

If I have two data sets which come from SQL tables which appear like this. Where table A contains 3 possible values for a given item and Table B containts a full path to a file name,
I have two data sets which come from SQL tables which appear like this.
TABLE A:
Column1 Column2 Column3
Value SecondValue ThirdValue
Value2 SecondValue2 ThirdValue2
Value3 SecondValue3 ThirdValue3
Table B:
Column1
PathToFile1\value.txt
PathToFile2\SecondValue2_ThirdValue.txt
PathToFile3\ThirdValue3_Value3.txt
I can extract any of the tables/columns to text, and I will use Java to find the full path (Table B) which contains any combination of the values in a row from (Table A).
Table B can have values such as c:\directory\file.txt, c:\directory\directory2\filename.txt or c:\filename.txt
What is the most efficient way to search for the paths, given the filename?
I have two ideas from coworkers, but I am not sure if they are the optimal solution.
1.Store the filename and path parsed from Table B in a hash map and then look up the paths using the values from A as the key. Doing this for each column of A.
2.Sort both alphabetically and do a binary-search using alphabetic order.
CLARIFICATION:
The path to the file in Table B can contain any one of the values from the columns in Table A. That is how they relate. The output has to run eventually in Java and I wanted to explore the options in Java, knowing SQL would be faster for relating the data. Also added some info to the table section. Please let me know if more info is needed.
I found this to help along with my answer, although not a specific answer to my question. I think using the info in this article can lead to the optimal practice.
http://www.javacodegeeks.com/2010/08/java-best-practices-vector-arraylist.html

What is the best way to match over 10000 different elements in database?

Ok here's my scenario:
Programming language: Java
I have a MYSQL database which has around 100,000,000 entries.
I have a a list of values in memory say valueList with around 10,000 entries.
I want to iterate through valueList and check whether each value in this list, has a match in the database.
This means I have to make atleast 10,000 database calls which is highly inefficient for my application.
Other way would be to load the entire database into memory once, and then do the comparison in the memory itself. This is fast but needs a huge amount of memory.
Could you guys suggest a better approach for this problem?
EDIT :
Suppose valueList consists of values like :
{"New","York","Brazil","Detroit"}
From the database, I'll have a match for Brazil and Detroit. But not for New and York , though New York would have matched. So the next step is , in case of any remaining non matched values, I combine them to see if they match now. So In this case, I combine New and York and then find the match.
In the approach I was following before( one by one database call) , this was possible. But in case of the approach of creatign a temp table, this wont be possible
You could insert the 10k records in a temporary table with a single insert like this
insert into tmp_table (id_col)
values (1),
(3),
...
(7);
Then join the the 2 tables to get the desired results.
I don't know your table structure, but it could be like this
select s.*
from some_table s
inner join tmp_table t on t.id_col = s.id

java solution for hashing lines that contain variables in .csv

I have a file that represent a table recorded in .csv or similar format. Table may include missing values.
I look for a solution (preferably in java), that would process my file in the incremental manner without loading everything into memory, as my file can be huge. I need to identify duplicate records in my file, being able to specify which columns I want to exclude from consideration; then produce an output grouping those duplicate records. I would add an additional value at the end with a group number and output in the same format (.csv) sorted by group number.
I hope an effective solution can be found with some hashing function. For example, reading all lines and storing a hash value with each line number, hash calculated based on the set of variables I provide as an input.
Any ideas?
Ok, here is the paper that holds the key to the answer: P. Gopalan & J. Radhakrishnan "Finding duplicates in a data stream".

Categories