split csv according to field in Java - java

I want to split a csv file according to the last "field".
For instance the csv file contains:
a,1
b,2
c,3
d,1
The numbers indicate categories.
This file should be split into seperate files according to the numbers (resp. categories) so that there exist three files.
first file:
a,1
d,1
second file:
b,2
third file:
c,3
The greedy method would be to read the csv per line, split the string at "," and seperate the last element (here the number). Afterwards I could check the number of the current line and put it into a FileWriter.
But: I do not know how many categories there will be as I want to keep the system extensible. Therefore the number of needed FileWriters is unknown.
As an alternative I could read the complete csv file for each category. In the first iteration only lines of category "1" would be processed and written into "1.csv", in the second step only lines of category "2" go into "2.csv" and so on.
But: This means the file has to be read as many times as categories exist which could be quite often.
Do you know whether there is an elegant solution for this purpose?
I also appreciate linux-based solutions! Maybe it is not necessary to create a Java program?
I guess that awk could be the tool of choice?
Thanks for your help!

Try this awk one-liner:
awk -F, '{print >> "output"$NF".csv"}' input.csv
It will read each line and write it to the appropriate output csv file, based on the value of the last field of the line.

I would make a more generic way. In this case I don't need to know all the items in the second column, so this is automatic:
total.csv:
a,1
b,2
c,3
d,1
script.sh:
#!/bin/bash
for line in $(cat total.csv)
do
filename=$(echo $line | awk -F "," '{print $2}')
echo $line >> $filename.csv
done
outputs: 1.csv 2.csv 3.csv

Related

Look for specific sequence of characters in a file and store a sequence of characters/numbers after it in Java

I am trying to open up a .htm file and read a revision number of a file from it.
Let's say these are the lines in the file.
I don't want to read this.
I don't want to read this.
I don't want to read this. I don't want to read this. I don't want to read this. I don't want to read this.
Now, I want to look for the following characters "HIJK # " and read the number following it. Say 123456.
It will be in a line like this HIJK # 123456
I want to look for the specific sequence of characters "HIJK # " and read the number following it and store that number in a variable. Any help is appreciated. Thanks!
Simple approach :
Parse the file using a parser (because it is a htm file) Line-by-Line (maybe use JSoup)
On each line (read as a string), use contains("HIJK #").
If 2 is true, do replaceAll(".*HIJK #(//d+).*","$1");
Bingo!, you got your number
Happy coding.

Find the same String in different Strings

i want to extract these Strings (XXXXX,GGGGG,PPPPP) from this Strings:
COPY XXXXX,PFX='PPPPP';
COPY XXXXX,PFX='PPPPP',GRUPPE='GGGGGG';
COPY XXXXX;
COPY XXXXX,'PPPPP';
COPY 'XXXXX','PPPPP','GGGGG';
COPY 'XXXXX','PPPPP',SUPPR='YES';
COPY XXXXX,PPPPP,GGGGG;
My problem is, that all these strings are different and i can't extract them. For every singel string i can do a regex, but not for all in one method.
xxxx can be e.g. TWT000
PPPP can be e.g. TWS000
GGGG can be e.g. TWSOOO
Any chance of getting all string types in one method to extract XXXX,PPPP,GGGG?
Split every string by space, then split by coma(,). For example(COPY XXXXX,PFX='PPPPP',GRUPPE='GGGGGG';):
first step:
COPY
XXXXX,PFX='PPPPP',GRUPPE='GGGGGG';
next step:
XXXXX
PFX='PPPPP'
GRUPPE='GGGGGG';
each line is a cell in array after splitting. Use some ifs to make regexp check or something if you must, for example in SUPPR='YES'; you don't want to parse YES.
Now extract everything inside '' quotations.
You have successfully extract your data.
use batch files and cmd commands you will find alot of helpful functions there
#echo off
findstr /m /c:%1 %2
if %errorlevel%==0 (
echo found
)
%1 is the string you are looking for
%2 is the text file you are searching into
and you can output the result into another file with ">"

Is it possible to append columns to a text file in java?

Am aware that
FileOutputStream out = new FileOutputStream(file_name, true);
allows appending rows to a file. Is there a way to append columns of data to a non-empty text file starting from the first row?
For e.g., file.txt contains:
Name Address
ABC OtherLand
Can we later modify file.txt to be:
Name Address PhoneNumber
ABC OtherLand 3333333333
I've heard of the awk command in Unix. If there isn't a way to do this directly in the java programming language, would appreciate if someone could share code-bits on calling awk using java syscalls.
Thanks!
No you can't. There is no such option. You can always open a file to read, write or append content to it.
To achieve this, you will need to
Read each line of a file
Append the content to each line.
Write to a temporary file

Carriage Return and Line Feed windows and Linux java application

I am working on a integration test application, this is what I am doing in the test case,
I read a test input file,which is stored in the cvs , write it to a file in the file system,the application polls the directory for the file, processes it and creates the output file, and I poll the directory for the output file, test case is successful if the both the file contents are equal(I am reading the both input files and output files into strings and comparing them).
The problem is this test case fails when its runs in a linux system, the reason being the file which is stored in the cvs was checked in from a windows system which contains CRLF as the line terminations whereas the output file generated has the line terminations as CR,now when I read these files and compare them character by character, they are having a mismatch.
could anyone help here.
You can check the line separator for the host operating system using System.getProperty("line.separator")
Since you're using text files, you can also compare the file contents line by line. Check LineNumberReader.readLine() for that.
You can try to compare them by lines. E.g. use FileUtils for this.
List<String> file1 = FileUtils.readLines(...);
List<String> file2 = FileUtils.readLines(...);
return file1.equals(file2);
You could remove all the '\r' characters from the downloaded file? Or replace the "\r\n" Windows string by the "\n" Linux one. Beware of the Mac case too: end of line could be identified by "\r".
When you check in the file, you can tell CVS it's a binary file (cvs add -kb), and then CVS will not convert line endings along the way.
This has other drawbacks too, e.g. no proper diff, but if you really test character by character, I guess you don't need that.
Please note that you must specify -kb when adding the file, you can't change it later.

Regarding Java Split Command CSV File Parsing

I have a csv file in the below format. I get an issue if either one of the beow csv data is read by the program
"D",abc"def,"","0429"292"0","11","IJ80","Feb10_1.txt-2","FILE RECORD","05/02/2010","04/03/2010","","1","-91","",""
"D","abc"def","","04292920","11","IJ80","Feb10_1.txt-2","FILE RECORD","05/02/2010","04/03/2010","","1","-91","",""
The below split command is used to ignore the commas inside the double quotes i got the below split command from an earlier post. Pasted the URL that i took this command
String items[] = line.split(",(?=([^\"]\"[^\"]\")[^\"]$)",15);
System.out.println("items.length"+items.length);
Regarding Java Split Command Parsing Csv File
The items.length is printed as 14 instead of 15. The abc"def is not recognized as a individual field and it's getting incorrectly stored as
"D",abc"def in items[0]. . I want it to be stored in the below way
items[0] should be "D" and items[1] should be abc"def
The same issue happens when there is a value "abc"def". I want it to be stored as
items[0] should be "D" and items[1] should be "abc"def"
Also this split command works perfectly if the double quotes repeated inside the double quotes( field value is D,"abc""def",1 ).
How can i resolve this issue.
I think you would be much better off writing a parser to parse the CSV files rather than try to use a regular expression. Once you start dealing with CSV files with carriage returns within the lines, then the Regex will probably fall apart. It wouldn't take that much code to write a simple while loop that went through all the characters and split up the data. It would be lot easier to deal with "Non-Standard"* CSV files such as yours when you have a parser rather than a Regex.
*I say non-standard because there isn't really an official standard for CSV, and when you're dealing with CSV files from many different systems, you see lots of weird things, like the abc"def field as shown above.
opencsv is a great simple and light weight CSV parser for Java. It will easily handle your data.
If possible, changing your CSV format would make the solution very simple.
See the following for an overview of Delimiter Separated Values, a common format on Unix-based systems:
http://www.faqs.org/docs/artu/ch05s02.html#id2901882
Opencsv is very simple and best API for CSV parsing . This can be done with Linux SED commands prior processing it in java . If File is not in proper format convert it into proper delimited which is your (" , " ) into pipe or other unique delimiter , so inside field value and column delimiter can be differentiated easily by Opencsv.Use the power of linux with your java code.

Categories