Join csv files ased on common column in java - java

I want to join two csv files based on a common column in. My two csv files and final csv file looks like this.
Here are the example files - 1st file looks like:
sno,first name,last name
--------------------------
1,xx,yy
2,aa,bb
2nd file looks like:
sno,place
-----------
1,pp
2,qq
Output:
sno,first name,last name,place
------------------------------
1,xx,yy,pp
2,aa,bb,qq
Code:
CSVReader r1 = new CSVReader(new FileReader("c:/csv/file1.csv"));;
CSVReader r2 = new CSVReader(new FileReader("c:/csv/file2.csv"));;
HashMap<String,String[]> dic = new HashMap<String,String[]>();
int commonCol = 1;
r1.readNext(); // skip header
String[] line = null;
while ((line = r1.readNext()) != null)
{
dic.put(line[commonCol],line)
}
commonCol = 1;
r2.readNext();
String[] line2 = null;
while ((line2 = r2.readNext()) != null)
{
if (dic.keySet().contains(line2[commonCol])
{
// append line to existing entry
}
else
{
// create a new entry and pre-pend it with default values
// for the columns of file1
}
}
foreach (String[] line : dic.valueSet())
{
// write line to the output file.
}
I don't know how to proceed further to get desired output. Any help will be appreciated.
Thanks

First, you need to use zero as your commonCol value as the first column has index zero rather than one.
if (dic.keySet().contains(line2[commonCol])
{
//Get the whole line from the first file.
String firstPart = dic.get(line2[commonCol]);
//Gets the line from the second file, without the common column.
String secondPart = String.join (Arrays.copyOfRange(line2, 1, line2.length -1), ",");
// Join together and put in Hashmap.
dic.put(line2[commonCol], String.join (firstPart, secondPart));
}
else
{
// create a new entry and pre-pend it with default values
// for the columns of file1
String firstPart = String.join(",","some", "default", "values")
String secondPart = String.join (Arrays.copyOfRange(line2, 1, line2.length -1), ",");
dic.put(line2[commonCol], String.join (firstPart, secondPart));
}

Related

How to deal with NumberFormatException when reading from a csv file [duplicate]

This question already has answers here:
How can I prevent java.lang.NumberFormatException: For input string: "N/A"?
(6 answers)
Closed 9 months ago.
My task is to read values from a csv file, and import each line of information from this file into an object array. I think my issue is the blank data elements in my csv file which doesn't work for my parsing from string to int, but I have found no way to deal with this. Here is my code:
`fileStream = new FileInputStream(pFileName);
rdr = new InputStreamReader(fileStream);
bufRdr = new BufferedReader(rdr);
lineNum = 0;`
while (line != null) {
lineNum++;
String[] Values = new String[13];
Values = line.split(",");
int cumulPos = Integer.parseInt(Values[6]);
int cumulDec = Integer.parseInt(Values[7]);
int cumuRec = Integer.parseInt(Values[8]);
int curPos = Integer.parseInt(Values[9]);
int hosp = Integer.parseInt(Values[10]);
int intenCar = Integer.parseInt(Values[11]);
double latitude = Double.parseDouble(Values[4]);
double longitude = Double.parseDouble(Values[5]);
covidrecordArray[lineNum] = new CovidRecord(Values[0], cumulPos, cumulDec, cumuRec, curPos, hosp,
intenCar, new Country(Values[1], Values[2], Values[3], Values[13], latitude, longitude));
If anyone could help it would be greatly appreciated.
As already suggested, use a proper CSV Parser if you can but if for some unknown reason you can't, this could be one way you can do it. Be sure to read the comments in code:
fileStream = new FileInputStream(pFileName);
rdr = new InputStreamReader(fileStream);
bufRdr = new BufferedReader(rdr);
// Remove the following line if there is no Header line in the CSV file.
String line = bufRdr.readLine();
String csvFileDataDelimiter = ",";
List<CovidRecord> recordsList = new ArrayList<>();
// True value calculated later in code (read comments).
int expectedNumberOfElements = 0; // 0 is default
while ((line = bufRdr.readLine()) != null) {
line = line.trim();
// If for some crazy reason a blank line is encountered...skip it.
if (line.isEmpty()) {
continue;
}
/* Get the expected number of elements within each CSV File Data Line.
This is based off of the number of actual delimiters within a file
data line plus 1. This is only calculated from the very first data
line. */
if (expectedNumberOfElements == 0) {
expectedNumberOfElements = line.replaceAll("[^\\" + csvFileDataDelimiter + "]", "").length() + 1;
}
/* Create and fill (with Null String) an array to be the expected
size of a CSV data line. This is done because if a data line
contains nothing for the last data element on that line then
when the line is split, the srray that is created will be short
by one element. This will ensure that there will alsways be a
Null String ("") present within the array when there is nothing
in the CSV data line. This null string is used in data validations
so as to provide a default value (like 0) if an Array Element
contains an actual Null String (""). */
String[] csvLineElements = new String[expectedNumberOfElements];
Arrays.fill(csvLineElements, "");
/* Take the array from the split (values) and place the data into
the csvLineElements[] array. */
String[] values = line.split("\\s*,\\s*"); // Takes care of any comma/whitespace combinations (if any).
for (int i = 0; i < values.length; i++) {
csvLineElements[i] = values[i];
}
/* Is the csvLineElements[] element a String representation of a signed
or unsigned integer data type value ("-?\\d+"). If so, convert the
String array element into an Integer value. If not, provide a default
value of 0. */
int cumulPos = Integer.parseInt(csvLineElements[6].matches("-?\\d+") ? csvLineElements[6] : "0");
int cumulDec = Integer.parseInt(csvLineElements[7].matches("-?\\d+") ? csvLineElements[7] : "0");
int cumuRec = Integer.parseInt(csvLineElements[8].matches("-?\\d+") ? csvLineElements[8] : "0");
int curPos = Integer.parseInt(csvLineElements[9].matches("-?\\d+") ? csvLineElements[9] : "0");
int hosp = Integer.parseInt(csvLineElements[10].matches("-?\\d+") ? csvLineElements[10] : "0");
int intenCar = Integer.parseInt(csvLineElements[11].matches("-?\\d+") ? csvLineElements[11] : "0");
/* Is the csvLineElements[] element a String representation of a signed
or unsigned integer or floating point value ("-?\\d+(\\.\\d+)?").
If so, convert the String array element into an Double data type value.
If not, provide a default value of 0.0 */
double latitude = Double.parseDouble(csvLineElements[4]
.matches("-?\\d+(\\.\\d+)?") ? csvLineElements[4] : "0.0d");
double longitude = Double.parseDouble(csvLineElements[5]
.matches("-?\\d+(\\.\\d+)?") ? csvLineElements[5] : "0.0d");
/* Create an instance of Country to pass into the constructor of
CovidRecord below. */
Country country = new Country(csvLineElements[1], csvLineElements[2],
csvLineElements[3], csvLineElements[13],
latitude, longitude);
// Create an add an instance of CovidRecord to the recordsList List.
recordsList.add(new CovidRecord(csvLineElements[0], cumulPos, cumulDec,
cumuRec, curPos, hosp, intenCar, country));
// Do what you want with the recordList List....
}
For obvious reasons, the code above was not tested. If you have any problems with it then let me know.
You will also notice the instead of the covidrecordArray[] CovidRecord Array I opted to use a List Interface named recordsList. This List can grow dynamically whereas the array is fixed meaning you need to determine the number of data lines within the file when initializing the array. This is not required with the List.
you can create one generic method for null check and check if it's null then return empty string or any thing else based on your needs
int hosp = Integer.parseInt(checkForNull(Values[10]));
public static String checkForNull(String val) {
return (val == null ? " " : val);
}

How to select random text value from specific row using java

I have three input fields.
First Name
Last item
Date Of Birth
I would like to get random data for each input from a property file.
This is how the property file looks. Field name and = should be ignored.
- First Name= Robert, Brian, Shawn, Bay, John, Paul
- Last Name= Jerry, Adam ,Lu , Eric
- Date of Birth= 01/12/12,12/10/12,1/2/17
Example: For First Name: File should randomly select one name from the following names
Robert, Brian, Shawn, Bay, John, Paul
Also I need to ignore anything before =
FileInputStream objfile = new FileInputStream(System.getProperty("user.dir "+path);
in = new BufferedReader(new InputStreamReader(objfile ));
String line = in.readLine();
while (line != null && !line.trim().isEmpty()) {
String eachRecord[]=line.trim().split(",");
Random rand = new Random();
//I need to pick first name randomly from the file from row 1.
send(firstName,(eachRecord[0]));
If you know that you're always going to have just those 3 lines in your property file I would get put each into a map with an index as the key then randomly generate a key in the range of the map.
// your code here to read the file in
HashMap<String, String> firstNameMap = new HashMap<String, String>();
HashMap<String, String> lastNameMap = new HashMap<String, String>();
HashMap<String, String> dobMap = new HashMap<String, String>();
String line;
while (line = in.readLine() != null) {
String[] parts = line.split("=");
if(parts[0].equals("First Name")) {
String[] values = lineParts[1].split(",");
for (int i = 0; i < values.length; ++i) {
firstNameMap.put(i, values[i]);
}
}
else if(parts[0].equals("Last Name")) {
// do the same as FN but for lastnamemap
}
else if(parts[0].equals("Date of Birth") {
// do the same as FN but for dobmap
}
}
// Now you can use the length of the map and a random number to get a value
// first name for instance:
int randomNum = ThreadLocalRandom.current().nextInt(0, firstNameMap.size(0 + 1);
System.out.println("First Name: " + firstNameMap.get(randomNum));
// and you would do the same for the other fields
The code can easily be refactored with some helper methods to make it cleaner, we'll leave that as a HW assignment :)
This way you have a cache of all your values that you can call at anytime and get a random value. I realize this isn't the most optimum solution having nested loops and 3 different maps but if your input file only contains 3 lines and you're not expecting to have millions of inputs it should be just fine.
Haven't programmed stuff like this in a long time.
Feel free to test it, and let me know if it works.
The result of this code should be a HashMap object called values
You can then get the specific fields you want from it, using get(field_name)
For example - values.get("First Name"). Make sure to use to correct case, because "first name" won't work.
If you want it all to be lower case, you can just add .toLowerCase() at the end of the line that puts the field and value into the HashMap
import java.lang.Math;
import java.util.HashMap;
public class Test
{
// arguments are passed using the text field below this editor
public static void main(String[] args)
{
// set the value of "in" here, so you actually read from it
HashMap<String, String> values = new HashMap<String, String>();
String line;
while (((line = in.readLine()) != null) && !line.trim().isEmpty()) {
if(!line.contains("=")) {
continue;
}
String[] lineParts = line.split("=");
String[] eachRecord = lineParts[1].split(",");
System.out.println("adding value of field type = " + lineParts[0].trim());
// now add the mapping to the values HashMap - values[field_name] = random_field_value
values.put(lineParts[0].trim(), eachRecord[(int) (Math.random() * eachRecord.length)].trim());
}
System.out.println("First Name = " + values.get("First Name"));
System.out.println("Last Name = " + values.get("Last Name"));
System.out.println("Date of Birth = " + values.get("Date of Birth"));
}
}

How to to add values from a file into an array using split?

I have the following code:
BufferedReader metaRead = new BufferedReader(new FileReader(metaFile));
String metaLine = "";
String [] metaData = new String [100000];
while ((metaLine = metaRead.readLine()) != null){
metaData = metaLine.split(",");
for (int i = 0; i < metaData.length; i++)
System.out.println(metaData[0]);
}
This is what's in the file:
testTable2 Name java.lang.Integer TRUE test
testTable2 age java.lang.String FALSE test
testTable2 ID java.lang.Integer FALSE test
I want the array to have at metaData[0] testTable2, metaData[1] would be Name, but when I run it at 0 I get testtable2testtable2testtable2, and at 1 I'd get NameageID and OutOfBoundsException.
Any ideas what to do in order to get the result I want?
Just print metaData[i] instead of metaData[0] and split each string by "[ ]+" (that means "1 or more spaces"):
metaData = metaLine.split("[ ]+");
As a result, you will get the following arrays:
[testTable2, Name, java.lang.Integer, TRUE, test]
[testTable2, age, java.lang.String, FALSE, test]
[testTable2, ID, java.lang.Integer, FALSE, test]
The code snippet to the preceding output results:
while ((metaLine = metaRead.readLine()) != null) {
metaData = metaLine.split("[ ]+");
for (int i = 0; i < metaData.length; i++)
System.out.print(metaData[i] + " ");
System.out.println();
}
Also, I've written your task by using Java 8 and Stream API:
List<String> collect = metaRead
.lines()
.flatMap(line -> Arrays.stream(line.split("[ ]+")))
.collect(Collectors.toList());
And, finally, there is the most straight-forward way:
final int LINES, WORDS;
String[] metaData = new String[LINES = 5 * (WORDS = 3)]; // I don't like it
int i = 0;
while ((metaLine = metaRead.readLine()) != null) {
for (String s : metaLine.split("[ ]+")) metaData[i++] = s;
}
Correct your code following line inside the for loop,
System.out.println(metaData[0]);
As
System.out.println(metaData[i]);
Although my answer may not fit completely with your question. But as i can see, your file format is TSV or CSV.
May be you should consider using OpenCSV
for your problem.
The library will handle reading, splitting process for you.

Reading and matching contents of two big files

I have two files each having the same format with approximately 100,000 lines. For each line in file one I am extracting the second component or column and if I find a match in the second column of second file, I extract their third components and combine them, store or output it.
Though my implementation works but the programs runs extremely slow, it takes more than an hour to iterate over the files, compare and output all the results.
I am reading and storing the data of both files in ArrayList then iterate over those list and do the comparison. Below is my code, is there any performance related glitch or its just normal for such an operation.
Note : I was using String.split() but I understand form other post that StringTokenizer is faster.
public ArrayList<String> match(String file1, String file2) throws IOException{
ArrayList<String> finalOut = new ArrayList<>();
try {
ArrayList<String> data = readGenreDataIntoMemory(file1);
ArrayList<String> data1 = readGenreDataIntoMemory(file2);
StringTokenizer st = null;
for(String line : data){
HashSet<String> genres = new HashSet<>();
boolean sameMovie = false;
String movie2 = "";
st = new StringTokenizer(line, "|");
//String line[] = fline.split("\\|");
String ratingInfo = st.nextToken();
String movie1 = st.nextToken();
String genreInfo = st.nextToken();
if(!genreInfo.equals("null")){
for(String s : genreInfo.split(",")){
genres.add(s);
}
}
StringTokenizer st1 = null;
for(String line1 : data1){
st1 = new StringTokenizer(line1, "|");
st1.nextToken();
movie2 = st1.nextToken();
String genreInfo2= st1.nextToken();
//If the movie name are similar then they should have the same genre
//Update their genres to be the same
if(!genreInfo2.equals("null") && movie1.equals(movie2)){
for(String s : genreInfo2.split(",")){
genres.add(s);
}
sameMovie = true;
break;
}
}
if(sameMovie){
finalOut.add(ratingInfo+""+movieName+""+genres.toString()+"\n");
}else if(sameMovie == false){
finalOut.add(line);
}
}
} catch (FileNotFoundException e) {
e.printStackTrace();
}
return finalOut;
}
I would use the Streams API
String file1 = "files1.txt";
String file2 = "files2.txt";
// get all the lines by movie name for each file.
Map<String, List<String[]>> map = Stream.of(Files.lines(Paths.get(file1)),
Files.lines(Paths.get(file2)))
.flatMap(p -> p)
.parallel()
.map(s -> s.split("[|]", 3))
.collect(Collectors.groupingByConcurrent(sa -> sa[1], Collectors.toList()));
// merge all the genres for each movie.
map.forEach((movie, lines) -> {
Set<String> genres = lines.stream()
.flatMap(l -> Stream.of(l[2].split(",")))
.collect(Collectors.toSet());
System.out.println("movie: " + movie + " genres: " + genres);
});
This has the advantage of being O(n) instead of O(n^2) and it's multi-threaded.
Do a hash join.
As of now you are doing an outer loop join which is O(n^2), the hash join will be amortized O(n)
Put the contents of each file in a hash map, with key the field you want (second field).
Map<String,String> map1 = new HashMap<>();
// build the map from file1
Then do the hash join
for(String key1 : map1.keySet()){
if(map2.containsKey(key1)){
// do your thing you found the match
}
}

how to read two consecutive commas from .csv file format as unique value in java

Suppose csv file contains
1,112,,ASIF
Following code eliminates the null value in between two consecutive commas.
Code provided is more than it is required
String p1=null, p2=null;
while ((lineData = Buffreadr.readLine()) != null)
{
row = new Vector(); int i=0;
StringTokenizer st = new StringTokenizer(lineData, ",");
while(st.hasMoreTokens())
{
row.addElement(st.nextElement());
if (row.get(i).toString().startsWith("\"")==true)
{
while(row.get(i).toString().endsWith("\"")==false)
{
p1= row.get(i).toString();
p2= st.nextElement().toString();
row.set(i,p1+", "+p2);
}
String CellValue= row.get(i).toString();
CellValue= CellValue.substring(1, CellValue.length() - 1);
row.set(i,CellValue);
//System.out.println(" Final Cell Value : "+row.get(i).toString());
}
eror=row.get(i).toString();
try
{
eror=eror.replace('\'',' ');
eror=eror.replace('[' , ' ');
eror=eror.replace(']' , ' ');
//System.out.println("Error "+ eror);
row.remove(i);
row.insertElementAt(eror, i);
}
catch (Exception e)
{
System.out.println("Error exception "+ eror);
}
//}
i++;
}
how to read two consecutive commas from .csv file format as unique value in java.
Here is an example of doing this by splitting to String array. Changed lines are marked as comments.
// Start of your code.
row = new Vector(); int i=0;
String[] st = lineData.split(","); // Changed
for (String s : st) { // Changed
row.addElement(s); // Changed
if (row.get(i).toString().startsWith("\"") == true) {
while (row.get(i).toString().endsWith("\"") == false) {
p1 = row.get(i).toString();
p2 = s.toString(); // Changed
row.set(i, p1 + ", " + p2);
}
...// Rest of Code here
}
The StringTokenizer skpis empty tokens. This is their behavious. From the JLS
StringTokenizer is a legacy class that is retained for compatibility reasons although its use is discouraged in new code. It is recommended that anyone seeking this functionality use the split method of String or the java.util.regex package instead.
Just use String.split(",") and you are done.
Just read the whole line into a string then do string.split(",").
The resulting array should have exactly what you are looking for...
If you need to check for "escaped" commas then you will need some regex for the query instead of a simple ",".
while ((lineData = Buffreadr.readLine()) != null) {
String[] row = line.split(",");
// Now process the array however you like, each cell in the csv is one entry in the array

Categories