I am trying to compare two CSV files that have the same data but columns in different orders. When the column orders match, the following code works: How can I tweak my following code to make it work when column orders don't match between the CSV files?
Set<String> source = new HashSet<>(org.apache.commons.io.FileUtils.readLines(new File(sourceFile)));
Set<String> target = new HashSet<>(org.apache.commons.io.FileUtils.readLines(new File(targetFile)));
return source.containsAll(target) && target.containsAll(source)
For example, the above test pass when the source file and target file are in this way:
source file:
a,b,c
1,2,3
4,5,6
target file:
a,b,c
1,2,3
4,5,6
However, the source file is same, but if the target file is in the following way, it doesn't work.
target file:
a,c,b
1,3,2
4,6,5
A Set relies on properly functioning .equalsmethod for comparison, whether detecting duplicates, or comparing it's elements to those in another Collection. When I saw this question, my first thought was to create a new class for Objects to put into your Set Objects, replacing the String Objects. But, at the time, it was easier and faster to produce the code in my previous answer.
Here is another solution, which is closer to my first thought. To start, I created a Pair class, which overrides .hashCode () and .equals (Object other).
package comparecsv1;
import java.util.Objects;
public class Pair <T, U> {
private final T t;
private final U u;
Pair (T aT, U aU) {
this.t = aT;
this.u = aU;
}
#Override
public int hashCode() {
int hash = 3;
hash = 59 * hash + Objects.hashCode(this.t);
hash = 59 * hash + Objects.hashCode(this.u);
return hash;
}
#Override
public boolean equals(Object obj) {
if (this == obj) { return true; }
if (obj == null) { return false; }
if (getClass() != obj.getClass()) { return false; }
final Pair<?, ?> other = (Pair<?, ?>) obj;
if (!Objects.equals(this.t, other.t)) {
return false;
}
return Objects.equals(this.u, other.u);
} // end equals
} // end class pair
The .equals (Object obj) and the .hashCode () methods were auto-generated by the IDE. As you know, .hashCode() should always be overridden when .equals is overridden. Also, some Collection Objects, such as HashMap and HashSet rely on proper .hashCode() methods.
After creating class Pair<T,U>, I created class CompareCSV1. The idea here is to use a Set<Set<Pair<String, String>>> where you have Set<String> in your code.
A Pair<String, String> pairs a value from a column with the header for the column in which it appears.
A Set<Pair<String, String>> represents one row.
A Set<Set<Pair<String, String>>> represents all the rows.
package comparecsv1;
import java.util.Arrays;
import java.util.HashSet;
import java.util.List;
import java.util.Set;
public final class CompareCSV1 {
private final Set<Set<Pair<String, String>>> theSet;
private final String [] columnHeader;
private CompareCSV1 (String columnHeadings, String headerSplitRegex) {
columnHeader = columnHeadings.split (headerSplitRegex);
theSet = new HashSet<> ();
}
private Set<Pair<String, String>> createLine
(String columnSource, String columnSplitRegex) {
String [] column = columnSource.split (columnSplitRegex);
Set<Pair<String, String>> lineSet = new HashSet<> ();
int i = 0;
for (String columnValue: column) {
lineSet.add (new Pair (columnValue, columnHeader [i++]));
}
return lineSet;
}
public Set<Set<Pair<String, String>>> getSet () { return theSet; }
public String [] getColumnHeaders () {
return Arrays.copyOf (columnHeader, columnHeader.length);
}
public static CompareCSV1 createFromData (List<String> theData
, String headerSplitRegex, String columnSplitRegex) {
CompareCSV1 result =
new CompareCSV1 (theData.get(0), headerSplitRegex);
for (int i = 1; i < theData.size(); ++i) {
result.theSet.add(result.createLine(theData.get(i), columnSplitRegex));
}
return result;
}
public static void main(String[] args) {
String [] sourceData = {"a,b,c,d,e", "6,7,8,9,10", "1,2,3,4,5"
,"11,12,13,14,15", "16,17,18,19,20"};
String [] targetData = {"c,b,e,d,a", "3,2,5,4,1", "8,7,10,9,6"
,"13,12,15,14,11", "18,17,20,19,16"};
List<String> source = Arrays.asList(sourceData);
List<String> target = Arrays.asList (targetData);
CompareCSV1 sourceCSV = createFromData (source, ",", ",");
CompareCSV1 targetCSV = createFromData (target, ",", ",");
System.out.println ("Source contains target? "
+ sourceCSV.getSet().containsAll (targetCSV.getSet())
+ ". Target contains source? "
+ targetCSV.getSet().containsAll (sourceCSV.getSet())
+ ". Are equal? " + targetCSV.getSet().equals (sourceCSV.getSet()));
} // end main
} // end class CompareCSV1
This code has some things in common with the code in my first answer:
Except for the column header lines, which must be first in the "source" and "Target" data, matching lines in one file can be in a different order in the other file.
I used String [] Objects, with calls to Arrays.asList method as substitutes for your data sources.
It does not contain code to guard against errors, such as lines in the file having different numbers of columns from other lines, or no header line.
I hard coded "," as the String split expression in main. But, the new methods allow the String split expression to be passed. It allows a separate String split expressions for the column header line and the data lines.
Here is some code that could work. It relies on the first line of each file containing column headers.
It's a bit more than a tweak, though. It's an "old dog" approach.
The original code in the question has these lines:
Set<String> source = new HashSet<>(org.apache.commons.io.FileUtils.readLines(new File(sourceFile)));
Set<String> target = new HashSet<>(org.apache.commons.io.FileUtils.readLines(new File(targetFile)));
With this solution, the data coming in needs more processing before it will be ready to be put into a Set. Those two lines get changed as follows:
List<String> source = (org.apache.commons.io.FileUtils.readLines(new File(sourceFile)));
List<String> target = (org.apache.commons.io.FileUtils.readLines(new File(targetFile)));
This approach will compare column headers in the target file and the source file. It will use that to build an int [] that indicates the difference in column order.
After the order difference array is filled, the data in the file will be put into a pair of Set<List<String>>. Each List<String> will represent one line from the source and target data files. Each String in the List will be data from one column.
In the following code, main is the test driver. Only for my testing purposes, the data files have been replaced by a pair of String [] and reading the file with org.apache.commons.io.FileUtils.readLines has been replaced with Arrays.asList.
package comparecsv;
import java.util.Arrays;
import java.util.HashSet;
import java.util.List;
import java.util.Set;
public class CompareCSV {
private static int [] columnReorder;
private static void headersOrder
(String sourceHeader, String targetHeader) {
String [] columnHeader = sourceHeader.split (",");
List<String> sourceColumn = Arrays.asList (columnHeader);
columnReorder = new int [columnHeader.length];
String [] targetColumn = targetHeader.split (",");
for (int i = 0; i < targetColumn.length; ++i) {
int j = sourceColumn.indexOf(targetColumn[i]);
columnReorder [i] = j;
}
}
private static Set<List<String>> toSet
(List<String> data, boolean reorder) {
Set<List<String>> dataSet = new HashSet<> ();
for (String s: data) {
String [] byColumn = s.split (",");
if (reorder) {
String [] reordered = new String [byColumn.length];
for (int i = 0; i < byColumn.length; ++i) {
reordered[columnReorder[i]] = byColumn [i];
}
dataSet.add (Arrays.asList (reordered));
} else {
dataSet.add (Arrays.asList(byColumn));
}
}
return dataSet;
}
public static void main(String[] args) {
String [] sourceData = {"a,b,c,d,e", "1,2,3,4,5", "6,7,8,9,10"
,"11,12,13,14,15", "16,17,18,19,20"};
String [] targetData = {"c,b,e,d,a", "3,2,5,4,1", "8,7,10,9,6"
,"13,12,15,14,11", "18,17,20,19,16"};
List<String> source = Arrays.asList(sourceData);
List<String> target = Arrays.asList (targetData);
headersOrder (source.get(0), target.get(0));
Set<List<String>> sourceSet = toSet (source, false);
Set<List<String>> targetSet = toSet (target, true);
System.out.println ( sourceSet.containsAll (targetSet)
+ " " + targetSet.containsAll (sourceSet) + " " +
( sourceSet.containsAll (targetSet)
&& targetSet.containsAll (sourceSet)));
}
}
MethodheadersOrder compares the headers, column by column, and populates the columnReorder array. Method toSet creates the Set<List<String>>, either reordering the columns or not, according to the value of the boolean argument.
For the sake of simplification, this assumes lines are easily split using comma. Data such as dog, "Reginald, III", 3 will cause failure.
In testing this, I found lines in the file can be matched with their counterpart in the other file, regardless of order of the lines. Here is an example:
Source:
a,b,c
1,2,3
4,5,6
7,8,9
Target:
a,b,c
4,5,6
7,8,9
1,2,3
The result would be the contents match.
I believe this would match a result from the O/P question code. However, for this solution to work, the first line in each file must contain column headers.
We have to find all simple words from a bunch of simple and compound words. For example:
Input: chat, ever, snapchat, snap, salesperson, per, person, sales, son, whatsoever, what so.
Output should be: chat, ever, snap, per, sales, son, what, so
My sample code:
private static String[] find(String[] words) {
// TODO Auto-generated method stub
//System.out.println();
ArrayList<String> alist = new ArrayList<String>();
Set<String> r1 = new HashSet<String>();
for(String s: words){
alist.add(s);
}
Collections.sort(alist,new Comparator<String>() {
public int compare(String o1, String o2) {
return o1.length()-o2.length();
}
});
//System.out.println(alist.toString());
int count= 0;
for(int i=0;i<alist.size();i++){
String check = alist.get(i);
r1.add(check);
for(int j=i+1;j<alist.size();j++){
String temp = alist.get(j);
//System.out.println(check+" "+temp);
if(temp.contains(check) ){
alist.remove(temp);
}
}
}
System.out.println(r1.toString());
String res[] = new String[r1.size()];
for(String i:words){
if(r1.contains(i)){
res[count++] = i;
}
}
return res;
}
I am unable to get a solution with the above code. Any suggestions or ideas
compound word = concatenation of two or more words;rest all words are considered as simple words
We have to remove all the compound words
Algorithm
Read the input into a set of Strings i.e. Set<String> input
Create a empty set for simple words i.e. Set<String> simpleWords
Create a empty set for compound words i.e. Set<String> compoundWords
Iterate over input. For each element
Let length of element be elemLength
Create a set Set<String> inputs of all Strings from the set input (excluding element) for which the below is true
Length less than element
Not present in compundWords
Create set of all permutations of inputs(by concatenating) with max length = elemLength i.e. Set<String> currentPermutations
See if any of currentPermutations is = element
If yes, add element into compoundWords
If no, continue with iteration
After the iteration is done place all Strings from input which are not present in compoundWords into simpleWords
That is your answer.
Before you start writing code decide the logic that you are going to use. Use descriptive variable names and you are basically done.
The reason your logic is not working has to do with the way you are checking temp.contains(check). This is checking for substring not a compound word as per your definition.
Here is what I am trying to do.
I am reading in a list of words with each having a level of complexity. Each line has a word followed by a comma and the level of the word. "watch, 2" for example. I wish to put all of the words of a given level into a set to ensure their uniqueness in that level. There are 5 levels of complexity, so ideally I'd like an array with 5 elements, each of which is a set.
I can then add words to each of the sets as I read them in. Later on, I wish to pull out a random word of a specified level.
I'm happy with everything except how to create an array of sets. I've read several other posts here that seem to agree that this can't be done exactly as I would hope, but I can't find a good work around. (No, I'm not willing to have 5 sets in a switch statement. Goes against the grain.)
Thanks.
You can use a map . Use level as key and value as the set which contains the words. This will help you to pull out the value for a given level, When a random word is requested from a level, get the value(set in this case) using the key which is the level and pick a random value from that. This will also scale if you increase the number of levels
public static void main(String[] args) {
Map<Integer, Set<String>> levelSet = new HashMap();
//Your code goes here to get the level and word
//
String word="";
int level=0;
addStringToLevel(levelSet,word,level);
}
private static void addStringToLevel(Map<Integer, Set<String>> levelSet,
String word, int level) {
if(levelSet.get(level) == null)
{
// this means this is the first string added for this level
// so create a container to hold the object
levelSet.put(level, new HashSet());
}
Set<String> wordContainer = levelSet.get(level);
wordContainer.add(word);
}
private static String getStringFromLevel(Map<Integer, Set<String>> levelSet,
int level) {
if(levelSet.get(level) == null)
{
return null;
}
Set<String> wordContainer = levelSet.get(level);
return "";// return a random string from wordContainer`
}
If you are willing to use Guava, try SetMultimap. It will take care of everything for you.
SetMultimap<Integer, String> map = HashMultimap.create();
map.put(5, "value");
The collection will take care of creating the inner Set instances for you unlike the array or List solutions which require either pre-creating the Sets or checking that they exist.
Consider using a List instead of an array.
Doing so might make your life easier.
List<Set<String>> wordSetLevels = new ArrayList();
// ...
for ( i = 0; i < 5; i++ ) {
wordSetLevels.add(new HashSet<String>());
}
wordSetLevels = Collections.unmodifiableList(wordSetLevels);
// ...
wordSetLevels.get(2).add("watch");
import java.util.HashSet;
import java.util.List;
import java.util.Set;
public class Main {
private Set<String>[] process(List<String> words) {
#SuppressWarnings("unchecked")
Set<String>[] arrayOfSets = new Set[5];
for(int i=0; i<arrayOfSets.length; i++) {
arrayOfSets[i] = new HashSet<String>();
}
for(String word: words) {
int index = getIndex(word);
String val = getValue(word);
arrayOfSets[index].add(val);
}
return arrayOfSets;
}
private int getIndex(String str) {
//TODO Implement
return 0;
}
private String getValue(String str) {
//TODO Implement
return "";
}
}
I have a list with some strings in it:
GS_456.java
GS_456_V1.java
GS_456_V2.java
GS_460.java
GS_460_V1.java
And it goes on. I want a list with the strings with the highest value:
GS_456_V2.java
GS_460_V1.java
.
.
.
I'm only thinking of using lots of for statements...but isn't there a more pratical way? I'd like to avoid using too many for statements...since i'm using them a lot when i execute some queries...
EDIT: The strings with the V1, V2,.... are the names of recent classes created. When someone creates a new version of GS_456 for example, they'll do it and add its version at the end of the name.
So, GS_456_V2 is the most recent version of the GS_456 java class. And it goes on.
Thanks in advance.
You will want to process the file names in two steps.
Step 1: split the list into sublists, with one sublist per file name (ignoring suffix).
Here is an example that splits the list into a Map:
private static Map> nameMap = new HashMap>();
private static void splitEmUp(final List names)
{
for (String current : names)
{
List listaly;
String[] splitaly = current.split("_|\\.");
listaly = nameMap.get(splitaly[1]);
if (listaly == null)
{
listaly = new LinkedList();
nameMap.put(splitaly[1], listaly);
}
listaly.add(current);
}
Step 2: find the highest prefix for each name. Here is an example:
private static List findEmAll()
{
List returnValue = new LinkedList();
Set keySet = nameMap.keySet();
for (String key : keySet)
{
List listaly = nameMap.get(key);
String highValue = null;
if (listaly.size() == 1)
{
highValue = listaly.get(0);
}
else
{
int highVersion = 0;
for (String name : listaly)
{
String[] versions = name.split("_V|\\.");
if (versions.length == 3)
{
int versionNumber = Integer.parseInt(versions[1]);
if (versionNumber > highVersion)
{
highValue = name;
highVersion = versionNumber;
}
}
}
}
returnValue.add(highValue);
}
return returnValue;
}
I guess you don't want simply the lexicographic order (the solution would be obvious).
First, remove the ".java" part and split your string on the character "_".
int dotIndex = string.indexOf(".");
String []parts = split.substring(0, dotIndex).split("_");
You are interested in parts[1] and parts[2]. The first is easy, it's just a number.
int fileNumber = Integer.parseInt(parts[1]);
The second one is always of the form "VX" with X being a number. But this part may not exist (if it's the base version of the file). In which case we can say that version is 0.
int versionNumber = parts.length < 2 ? 0 : Integer.parseInt(parts[2].substring(1));
Now you can compare based on these two numbers.
To make things simple, build a class FileIdentifier based on this:
class FileIdentifier {
int fileNumber;
int versionNumber;
}
Then a function that create a FileIdentifier from a file name, with logic based on what I explained earlier.
FileIdentifier getFileIdentifierFromFileName(String filename){ /* .... */ }
Then you make a comparator on String, in which you get the FileIdentifier for the two strings and compare upon FileIdentifier members.
Then, to get the string with "the highest value", you simply put all your strings in a list, and use Collections.sort, providing the comparator.