I have a class that try to read a CSV file using Apache Common CSV, so far my code is working fine except that am not getting the result am expecting.
My code is displaying a duplicate of the second column in the csv file as below:
support#gmail.com
google
google.com
support#gmail.com
google
tutorialspoint
info#tuto.com
google
My CSV File
Name,User Name,Password
google.com,support#gmail.com,google
tutorialspoint,info#tuto.com,google
i expect to get something like this:
google.com
support#gmail.com
google
tutorialspoint
info#tuto.com
google
Here is my block that parses the csv using Apache CSV
public List<String> readCSV(String[] fields) {
// HERE WE START PROCESSING THE READ CSV CONTENTS
List<String> contents = new ArrayList<String>();
FileReader fileReader = null;
CSVParser csvFileParser = null;
// HERE WE START PROCESSING
if(fields!=null){
//Create the CSVFormat object with the header mapping
CSVFormat csvFileFormat = CSVFormat.DEFAULT.withHeader(FILE_HEADER_MAPPING);
try {
//Create a new list of student to be filled by CSV file data
List<String> content=new ArrayList<String>();
//initialize FileReader object
fileReader = new FileReader(FilePath);
//initialize CSVParser object
csvFileParser = new CSVParser(fileReader, csvFileFormat);
//Get a list of CSV file records
List<CSVRecord> csvRecords = csvFileParser.getRecords();
//Read the CSV file records starting from the second record to skip the header
for (int i = 1; i < csvRecords.size(); i++) {
CSVRecord record = csvRecords.get(i);
//Create a new student object and fill his data
for(int j=0; j<fields.length; j++){
content.add(record.get(fields[j]));
}
// Here we submit to contents
contents.addAll(content);
System.out.println(contents.size());
} // end of loop
}
catch (Exception e) {
e.printStackTrace();
} finally {
try {
fileReader.close();
csvFileParser.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
// Here we return
return contents;
}
I cant just figure out what am missing here, any help will be welcomed.
The reason is that you're adding the String list content each iteration
contents.addAll(content);
Either clear content on each iteration or just change
content.add(record.get(fields[j]));
to
contents.add(record.get(fields[j]));
and remove the
contents.addAll(content);
line
Related
I have a CSV with headers for eg:-
Title,Project ID,summary,priority
1,1,Test summary,High
Now i want to get the headers list that are passed in the CSV file.
NOTE: The headers passed will be different every time.
Thanks in advance
You can use CSVReader
String fileName = "data.csv";
CSVReader reader = new CSVReader(new FileReader(fileName ));
// if the first line is the header
String[] header = reader.readNext();
You can read csv file line by line.
Split the line at comma.
Split method returns array.
Each array element contain value from line read.
Suppose Title and Project ID fields are of integer type then whichever 2 elements are integer treat first as title and second as Project ID.
Strings can be considered as Summary and Priority
You could use the org.apache.commons.csv.CSVParser of apache commons. It has methods to get the headers and the content.
Try below to read header alone from a CSV file
BufferedReader br = new BufferedReader(new FileReader("myfile.csv"));
String header = br.readLine();
if (header != null) {
String[] columns = header.split(",");
}
Apache commons CSV: To get only header content from csv file and store it in list, use the below code to get it.
List<Map<String, Integer>> list = new ArrayList<>();
try (Reader reader = Files.newBufferedReader(Paths.get(CSV_FILE_PATH))) {
CSVParser csvParser = new CSVParser(reader, CSVFormat.DEFAULT.withHeader());
Map<String, Integer> header = csvParser.getHeaderMap();
list.add(header);
list.forEach(System.out::println);
}catch (IOException e){
e.printStackTrace();
}
Problem: I want to read a section of a file from HDFS and return it, such as lines 101-120 from a file of 1000 lines.
I don't want to use seek because I have read that it is expensive.
I have log files which I am using PIG to process down into meaningful sets of data. I've been writing an API to return the data for consumption and display by a front end. Those processed data sets can be large enough that I don't want to read the entire file out of Hadoop in one slurp to save wire time and bandwidth. (Let's say 5 - 10MB)
Currently I am using a BufferedReader to return small summary files which is working fine
ArrayList lines = new ArrayList();
...
for (FileStatus item: items) {
// ignoring files like _SUCCESS
if(item.getPath().getName().startsWith("_")) {
continue;
}
in = fs.open(item.getPath());
BufferedReader br = new BufferedReader(new InputStreamReader(in));
String line;
line = br.readLine();
while (line != null) {
line = line.replaceAll("(\\r|\\n)", "");
lines.add(line.split("\t"));
line = br.readLine();
}
}
I've poked around the interwebs quite a bit as well as Stack but haven't found exactly what I need.
Perhaps this is completely the wrong way to go about doing it and I need a completely separate set of code and different functions to manage this. Open to any suggestions.
Thanks!
As added noted based on research from the below discussions:
How does Hadoop process records records split across block boundaries?
Hadoop FileSplit Reading
I think SEEK is a best option for reading files with huge volumes. It did not cause any problems to me as the volume of data that i was reading was in the range of 2 - 3GB. I did not encounter any issues till today but we did use file splitting to handle the large data set. below is the code which you can use for reading purpose and test the same.
public class HDFSClientTesting {
/**
* #param args
*/
public static void main(String[] args) {
// TODO Auto-generated method stub
try{
//System.loadLibrary("libhadoop.so");
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
conf.addResource(new Path("core-site.xml"));
String Filename = "/dir/00000027";
long ByteOffset = 3185041;
SequenceFile.Reader rdr = new SequenceFile.Reader(fs, new Path(Filename), conf);
Text key = new Text();
Text value = new Text();
rdr.seek(ByteOffset);
rdr.next(key,value);
//Plain text
JSONObject jso = new JSONObject(value.toString());
String content = jso.getString("body");
System.out.println("\n\n\n" + content + "\n\n\n");
File file =new File("test.gz");
file.createNewFile();
}
catch (Exception e ){
throw new RuntimeException(e);
}
finally{
}
}
}
I have DB2 database with XML column. I would like to read data from it and save each XML to separate file.
Here is a part of my code:
final List<Map<String, Object>> myList = dbcManager.createQuery(query).getResultList();
int i=0;
for (final Map<String, Object> element : myList) {
i++;
String filePath = "C://elements//elem_" + i + ".xml";
File file = new File(filePath);
if(!file.exists()){
file.createNewFile();
}
BufferedWriter out = new BufferedWriter(new FileWriter(filePath));
out.write(element.get("columnId"));
out.close();
}
Now, I have error in line out.write(element.get("columnId"));, because element.get("columnId") is an object type and it should be for example string.
And my question is: To which type should I convert (cast) element.get("columnId") to save it in xml file?
You should use the ResultSet.getSQLXML() method to read the XML column value, then use an appropriate method of the SQLXML class, e.g. getString() or getCharacterStream(). More info here.
I'm trying to write a method that simply reads in a CSV file and stores the data from the file. Here is a link to a screenshot of the CSV file I am trying to read in, and the code for this method:
http://i.imgur.com/jsGTg.png
public static void correctPrices(String correctfile) {
String data;
Date date;
SimpleDateFormat formatter = new SimpleDateFormat("MM/dd/yyyy");
File correctedfile = new File(correctfile);
Scanner correct;
try {
correct = new Scanner(correctedfile);
correct.nextLine(); //to avoid reading the heading
ArrayList<Date> correctdate = new ArrayList<Date>();
ArrayList<String> correctdata = new ArrayList<String>();
while (correct.hasNext()) {
correctdata.add(correct.nextLine());
//data = correct.nextLine();
//String[] corrected = correct.nextLine().split(",");
//date = formatter.parse(corrected[0]);
//correctdate.add(date);
}
for (int i = 0; i < correctdata.size(); i++) {
System.out.println(correctdata.get(i));
}
}
catch (FileNotFoundException ex) {
Logger.getLogger(DataHandler.class.getName()).log(Level.SEVERE, null, ex);
}
}
As expected, this code would output the last 2 lines of the file. However, when I un-comment data = correct.nextLine(); in the while loop, the output will only return the second line of the CSV, and not the last line. I'm a little baffled by this? All I've tried to do was to store the line into another variable, why would the last line be omitted? Thanks for your help and time, let me know if you need any additional info!
The problem is, when you call correct.nextLine(), it reads in a line and then increments a pointer in the file to where you are reading. Since you call that multiple times in the loop, it increments the pointer multiple times, skipping lines. What you should do is just read the line once, in the beginning of the while loop using
data = correct.nextLine();
and then replace correct.nextLine() everywhere else it appears in the loop with data.
In other words, your while loop would look like
while (correct.hasNext())
{
data = correct.nextLine();
correctdata.add(data);
String[] corrected = data.split(",");
date = formatter.parse(corrected[0]);
correctdate.add(date);
}
I'm reading 2 csv files: store_inventory & new_acquisitions.
I want to be able to compare the store_inventory csv file with new_acquisitions.
1) If the item names match just update the quantity in store_inventory.
2) If new_acquisitions has a new item that does not exist in store_inventory, then add it to the store_inventory.
Here is what i have done so far but its not very good. I added comments where i need to add taks 1 & 2.
Any advice or code to do the above tasks would be great! thanks.
File new_acq = new File("/src/test/new_acquisitions.csv");
Scanner acq_scan = null;
try {
acq_scan = new Scanner(new_acq);
} catch (FileNotFoundException ex) {
Logger.getLogger(mainpage.class.getName()).log(Level.SEVERE, null, ex);
}
String itemName;
int quantity;
Double cost;
Double price;
File store_inv = new File("/src/test/store_inventory.csv");
Scanner invscan = null;
try {
invscan = new Scanner(store_inv);
} catch (FileNotFoundException ex) {
Logger.getLogger(mainpage.class.getName()).log(Level.SEVERE, null, ex);
}
String itemNameInv;
int quantityInv;
Double costInv;
Double priceInv;
while (acq_scan.hasNext()) {
String line = acq_scan.nextLine();
if (line.charAt(0) == '#') {
continue;
}
String[] split = line.split(",");
itemName = split[0];
quantity = Integer.parseInt(split[1]);
cost = Double.parseDouble(split[2]);
price = Double.parseDouble(split[3]);
while(invscan.hasNext()) {
String line2 = invscan.nextLine();
if (line2.charAt(0) == '#') {
continue;
}
String[] split2 = line2.split(",");
itemNameInv = split2[0];
quantityInv = Integer.parseInt(split2[1]);
costInv = Double.parseDouble(split2[2]);
priceInv = Double.parseDouble(split2[3]);
if(itemName == itemNameInv) {
//update quantity
}
}
//add new entry into csv file
}
Thanks again for any help. =]
Suggest you use one of the existing CSV parser such as Commons CSV or Super CSV instead of reinventing the wheel. Should make your life a lot easier.
Your implementation makes the common mistake of breaking the line on commas by using line.split(","). This does not work because the values themselves might have commas in them. If that happens, the value must be quoted, and you need to ignore commas within the quotes. The split method can not do this -- I see this mistake a lot.
Here is the source of an implementation that does it correctly:
http://agiletribe.purplehillsbooks.com/2012/11/23/the-only-class-you-need-for-csv-files/
With help of the open source library uniVocity-parsers, you could develop with pretty clean code as following:
private void processInventory() throws IOException {
/**
* ---------------------------------------------
* Read CSV rows into list of beans you defined
* ---------------------------------------------
*/
// 1st, config the CSV reader with row processor attaching the bean definition
CsvParserSettings settings = new CsvParserSettings();
settings.getFormat().setLineSeparator("\n");
BeanListProcessor<Inventory> rowProcessor = new BeanListProcessor<Inventory>(Inventory.class);
settings.setRowProcessor(rowProcessor);
settings.setHeaderExtractionEnabled(true);
// 2nd, parse all rows from the CSV file into the list of beans you defined
CsvParser parser = new CsvParser(settings);
parser.parse(new FileReader("/src/test/store_inventory.csv"));
List<Inventory> storeInvList = rowProcessor.getBeans();
Iterator<Inventory> storeInvIterator = storeInvList.iterator();
parser.parse(new FileReader("/src/test/new_acquisitions.csv"));
List<Inventory> newAcqList = rowProcessor.getBeans();
Iterator<Inventory> newAcqIterator = newAcqList.iterator();
// 3rd, process the beans with business logic
while (newAcqIterator.hasNext()) {
Inventory newAcq = newAcqIterator.next();
boolean isItemIncluded = false;
while (storeInvIterator.hasNext()) {
Inventory storeInv = storeInvIterator.next();
// 1) If the item names match just update the quantity in store_inventory
if (storeInv.getItemName().equalsIgnoreCase(newAcq.getItemName())) {
storeInv.setQuantity(newAcq.getQuantity());
isItemIncluded = true;
}
}
// 2) If new_acquisitions has a new item that does not exist in store_inventory,
// then add it to the store_inventory.
if (!isItemIncluded) {
storeInvList.add(newAcq);
}
}
}
Just follow this code sample I worked out according to your requirements. Note that the library provided simplified API and significent performance for parsing CSV files.
The operation you are performing will require that for each item in your new acquisitions, you will need to search each item in inventory for a match. This is not only not efficient, but the scanner that you have set up for your inventory file would need to be reset after each item.
I would suggest that you add your new acquisitions and your inventory to collections and then iterate over your new acquisitions and look up the new item in your inventory collection. If the item exists, update the item. If it doesnt, add it to the inventory collection. For this activity, it might be good to write a simple class to contain an inventory item. It could be used for both the new acquisitions and for the inventory. For a fast lookup, I would suggest that you use HashSet or HashMap for your inventory collection.
At the end of the process, dont forget to persist the changes to your inventory file.
As Java doesn’t support parsing of CSV files natively, we have to rely on third party library. Opencsv is one of the best library available for this purpose. It’s open source and is shipped with Apache 2.0 licence which makes it possible for commercial use.
Here, this link should help you and others in the situations!
For writing to CSV
public void writeCSV() {
// Delimiter used in CSV file
private static final String NEW_LINE_SEPARATOR = "\n";
// CSV file header
private static final Object[] FILE_HEADER = { "Empoyee Name","Empoyee Code", "In Time", "Out Time", "Duration", "Is Working Day" };
String fileName = "fileName.csv");
List<Objects> objects = new ArrayList<Objects>();
FileWriter fileWriter = null;
CSVPrinter csvFilePrinter = null;
// Create the CSVFormat object with "\n" as a record delimiter
CSVFormat csvFileFormat = CSVFormat.DEFAULT.withRecordSeparator(NEW_LINE_SEPARATOR);
try {
fileWriter = new FileWriter(fileName);
csvFilePrinter = new CSVPrinter(fileWriter, csvFileFormat);
csvFilePrinter.printRecord(FILE_HEADER);
// Write a new student object list to the CSV file
for (Object object : objects) {
List<String> record = new ArrayList<String>();
record.add(object.getValue1().toString());
record.add(object.getValue2().toString());
record.add(object.getValue3().toString());
csvFilePrinter.printRecord(record);
}
} catch (Exception e) {
e.printStackTrace();
} finally {
try {
fileWriter.flush();
fileWriter.close();
csvFilePrinter.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
You can use Apache Commons CSV api.
FYI this anwser : https://stackoverflow.com/a/42198895/6549532
Read / Write Example