How to reorder a 60mb CSV file - java

I have a .csv file that is ordered in a certain way. I want to reorder it by another field. Your ideas would be much appreciated.
I only have to do this once, not multiple times, so performance is not too much of an issue.
What I am thinking.
If I just create an object (java) to hold each of the fields and then create an ArrayList of these objects. I will then order the ArrayList on the field I want (I can order an ArrayList of objects based on one member of the object - right?), and print this reordered ArrayList to a .csv file.

Souds like it would work but is also some overkill. If you have a unix box or cygwin you could just do
cat file | sort -t , +<field number>
This will break the fields up by , and sort by the field number
cat file | sort -t , +2
sorts by the second field.

Can't you just load the csv into Excel, use the sort function to reorder it, and then save the result as a new csv file?

If you have access to a Linux box then use sort as suggested above. But, if it has to be Java then at least use an existing library to parse the CSV file. The format is hellishly complicated to parse if you want to handle all the corner cases correctly. I'd suggest a library like OpenCSV.
This code snippet show how to use the library (with all error handling omitted!)
/**
* Sorts a CSV file by a fixed column.
*
* #param col The zero-based column to sort by.
* #param in The input CSV file.
* #param out The output writer to receive the reordered CSV.
*/
public static void sort(final int col, final Reader in, final Writer out)
throws IOException {
final List<String[]> csvContent = new ArrayList<String[]>();
// parse CSV file
final CSVReader reader = new CSVReader(in);
String[] line;
while ((line = reader.readNext()) != null) {
csvContent.add(line);
}
reader.close();
// sort CSV content
Collections.sort(csvContent, new Comparator<String[]>() {
#Override
public int compare(final String[] o1, final String[] o2) {
// adjust here for numeric sort, etc.
return o1[col].compareTo(o2[col]);
}
});
// write sorted content
final CSVWriter writer = new CSVWriter(out);
writer.writeAll(csvContent);
writer.close();
}
You can adjust the code to handle different separator characters, quote chars, numeric sorting, etc.

If you know how to use Vim: http://vim.wikia.com/wiki/Working_with_CSV_files
CSV files (comma-separated values) are
often used to save tables of data in
plain text. Following are some useful
techniques for working with CSV files.
You can:
Highlight all text in any column.
View fields (convert csv text to columns or separate lines).
Navigate using the HJKL keys to go left, down, up, right by cell (hjkl
work as normal).
Search for text in a specific column.
Sort lines by column.
Delete a column.
Specify a delimiter other than comma.

Related

Write ArrayList of Object that contain ArrayList to CSV

I am tasked with the work of Scraping data from a webpage and write them a long with other information into a CSV. Currently I used JSoup to scrape the website but my problem is not sure how to write them to a CSV.
I store the data of each scraped page inside of an Object calls CSVObject:
public class CSVObject {
String name;
String title;
String description;
String ArrayList<String> color;
String ArrayList<String> size;
String ArrayList<float> price;
}
I store these Objects in an ArrayList<CSVObject>
The name, title, description is from the scraped data but the color, size and price are from user input. They can choose multiple and it will add to the ArrayList in the Object.
The desired file output is something like this:
Name Title Description Color Size Price
Shirt Holiday Shirt Shirt Description Black S 15.99
Shirt Black M 19.99
Shirt Black L 24.99
Shirt Green S 15.99
Shirt Green M 19.99
Shirt Green L 24.99
Pants Movie Pants Pants Description Red S 17.99
...
I did some digging and found Java CSV Library in How to serialize object to CSV file? can help write file to CSV but I am not sure how to format it to the desire output. So what should I do to write the file as intended?
Flat file
Comma-Separated Values (CSV) and Tab-Delimited formats are for flat files, a single table in each. That means one set of rows that all share the same set of columns.
To export the data as seen in your example data, repeat the values in the first columns that you have suppressed. Then you would have a set of rows all sharing the same set of columns.
Hierarchy
According to your Java class, you have a hierarchy of data. That does not fit CSV format. Square peg, round hole.
To match the structure of your Java class, you should be serializing your data in a hierarchical format such as XML or JSON.
Not-really-CSV
If you insist on using that not-really-CSV format you showed, you need nested loops.
Loop your set of objects. For each of those objects, loop the lists contained within.
On the first time through the lists, write out all columns. For subsequent times in the inner loop, suppress those values, writing only a COMMA character to maintain the column count.
Straight-forward logic, nothing tricky, following the same steps you would take if writing these values by hand to paper.
Of course, any field values containing your delimiter character (COMMA, etc.) must be enclosed within quotes. Or just enclose all fields in quotes.
Here's a quick and dirty, it assumes your lists of color, prices and sizes always have the same length
interface CSVObject {
String name();
String title();
String description();
List<String> color();
List<String> size();
List<Double> price();
}
List<CSVObject> data = List.of();
String csv =data.stream()
.flatMap(co->IntStream.range(0,co.color().size())
.mapToObj(i->new String[]{co.name(),co.title(),co.description(),co.color().get(i),co.size().get(i),co.price().get(i).toString()} ))
.map(sa-> Arrays.stream(sa).collect(Collectors.joining(",")))
.collect(Collectors.joining("\n"));

Writing to a specific position of a file

I have a text file which gets created by a batch script where the field position are always the same. Within that file, I want to go to a specific position and add a field to it.
For example, suppose my file has only two lines and has the following fields:
number customer_id account_no address price plan
number customer_id account_no address price plan
I want to add an extra field between address and price so the new file will look similar to this:
number customer_id account_no address newfield price plan
number customer_id account_no address newfield price plan
I couldn't find any Utility class that can go to a specific position of a line in a file and write to it.
I could do it the tedious way of placing the fields to an array including the whitespace and spitting it out field by field to a new file, however, that was too cumbersome and very tedious (since it has more than 90 fields) and was wondering if there is an easier way.
I could do it the tedious way of placing the fields to an array including the whitespace and spitting it out field by field to a new file, however, that was too cumbersome and very tedious (since it has more than 90 fields) and was wondering if there is an easier way.
If you were to do it by using Java, there is no way you can insert data in between text. It only allows you to append to the end of the file.
Anyway if you need to insert a new column and maintain the format consistency of the entire data file, you need to insert a whitespace for those rows without a value, hence rewriting all rows is still inevitable.
I couldn't find any Utility class that can go to a specific position of a line in a file and write to it
If you couldn't find one, you can write one yourself. It is actually not that tedious. Just write a utility class yourself, so you can use it in future.
//A brief example..
public final class FileUtility{
private static String filepath;
public static void setFilePath(String filepath){
FileUtility.filepath = filepath;
}
public static int searchField (String fieldname, int lineNo){
//return position of given field namein a specific line
}
public static void insertDataAt (String data, int column){
//return position of given field
}
public static boolean dataExist(String data, int lineNo){
//return true if given data exist at given line number
}
}
Forget about appending to the middle of the file. Files are sequence of bytes, so, you need to process all the bytes after insert points first to move them N bytes forward to create a place for your modification. This costs more than processing file on the fly and writing new lines to another file.

Generic Class for JTable

i got a task which iam not sure of how to solve.
I have to fill a JTable with rows i get from a .txt document. The problem is that there are multiple .txt documents which have more or less rows and columns for the JTable.
example:
inside the cars.txt:
id;hp;price;quantity
1;100;7000;5
4;120;20000;2
7;300;80000;3
inside the bikes.txt
id;price;quantity;color;year
3;80;20;red;2010
5;200;40;green;2011
12;150;10;blue;2007
So, when a .txt is chosen a JDialog will pop up with a JTable inside, where the data will be shown.
I thought that i could maybe create a "class Anything" where i have a instance variable String[][] which i can define the sizes by reading the .txt and after saving the data in one array i can count how many rows and how many columns it has,
with the cars.txt example it would be: String[4][3]
Is that a good way to work with or is there a better way to do it?
Thanks for the help :D
Your question is a bit vague on what you want to do specifically.
Do you want to simply fill the table with all data given? Or do you only want certain columns used? When you choose the text files are you aware of which column names they have (can you hardcode this or not).
A good start would be...
EDITED here's the solution.....
DefaultTableModel dtm = (DefaultTableModel)yourJTable.getModel();
// This divides your txt file to a string array divided by rows
string[] RowSplit = yourTxtFileThatYouRead.split("\n");
//this assumes that your txt file contains the column headers
dtm.setColumnHeaders(RowSplit[0].split(";"));
//Start the iteration at 1 to skip the column headers
for (int i = 1; i < RowSplit.length; ++i) {
dtm.addRow(RowSplit[i].split(//some delimeter //));
dtm.fireTableDataChanged();
The first part sets the column headers and enables for variation within your table column size.
The second part sequentially adds rows.
edited for formatting
edited for better answer
As shown in How to Use Tables: Creating a Table Model, you can extend AbstractTableModel to manage models of arbitrary dimensions. Let your model manage a List<List<String>>. Parse the first line of each file into a List<String> that is accessed by your implementations of getColumnCount() and getColumnName(). Parse subsequent lines into one List<String> per row; access the List of such rows in your implementation of getValueAt(). A related example that manages a Map<String, String> is shown here. Although more complex, you can use Class Literals as Runtime-Type Tokens for non-string data; return the token in your implementation of getColumnClass() to get the default render and editor for supported types. Finally, consider one of these file based JDBC drivers for flat files.

how to read a specific row from csv file using CsvReader in java

I am using CsvReader library and want to read a specific row from a csv file in java.
Sample csv : **Name**, **Address**, **Email-Id**
student, studentaddress, student#email.com
student2, student2address, student2#email.com
employee, employeeaddres1, employee#email.com
I want to read the row where name is 'student2'.
Could you please provide a solution?
Thanks in advance.
As rows have different sizes in bytes, and as the CSV format doesn't contain an index, you can't have a random access directly to one row.
So you must read all precedent rows and simply skip them until you're at the desired one.
I have some experience with this type of operation ,
just try this API http://opencsv.sourceforge.net/
we have an option to skip the first n records
eg: CSVReader reader = new CSVReader(new FileReader("yourfile.csv"), '\t', '\'', 2);
this shows it will skip the first 2 records.
look through it
Other API's

How to store matrix information in MySQL?

I'm working on an application that analizes music similarity. In order to do that I proccess audio data and store the results in txt files. For each audio file I create 2 files, 1 containing and 16 values (each value can be like this:2.7000023942731723) and the other file contains 16 rows, each row containing 16 values like the one previously shown.
I'd like to store the contents of these 2 file in a table of my MySQL database.
My table looks like:
Name varchar(100)
Author varchar (100)
in order to add the content of those 2 file I think I need to use the BLOB data type:
file1 blob
file2 blob
My question is how should I store this info in the data base? I'm working with Java where I have a double array containing the 16 values (for the file1) and a matrix containing the file2 info. Should I process the values as strings and add them to the columns in my database?
Thanks
Hope I don't get negative repped into oblivion with this crazy answer, but I am trying to think outside the box. My first question is, how are you processing this data after a potential query? If I were doing something similar, I would likely use something like matlab or octave, which have a specific notation for representing matricies. It is basically a bunch of comma and semicolon delimited text with square brackets at the right spots. I would store just a string that my mathematics software or module can parse natively. After all, it doesn't sound like you want to do some kind of query based on a data point.
I think you need to normalize a schema like this if you intend to keep it in a relational database.
Sounds like you have a matrix table that has a one-to-many relationship with its files.
If you insist on one denormalized table, one way to do it would be to store the name of the file, its author, the name of its matrix, and its row and column position in the named matrix that owns it.
Please clarify one thing: Is this a matrix in the linear algebra sense? A mathematical entity?
If yes, and you only use the matrix in its entirety, then maybe you can store it in a single column as a blob. That still forces you to serialize and deserialize to a string or blob every time it goes into and comes out of the database.
Do you need to query the data (say for all the values that are bigger than 2.7) or just store it (you always load the whole file from the database)?
Given the information in the comment I would save the files in a BLOB or TEXT like said in other answers. You don't even need a line delimiter since you can do a modulus operation on the list of values to get the row of the matrix.
I think the problem that dedalo is facing is that he's working with arrays (I assume one is jagged, one is multi-demensional) and he wants to serialize these to blob.
But, arrays aren't directly serializable so he's asking how to go about doing this.
The simplest way to go about it would be to loop through the array and build a string as Dave suggested and store the string. This would allow you to view the contents from the value in the database instead of deserializing the data whenever you need to inpsect it, as duffymo points out.
If you'd like to know how to serialize the array into BLOB...(this just seems like overkill)
You are able to serialize one-dimensional arrays and jagged arrays, e.g.:
public class Test {
public static void main(String[] args) throws Exception {
// Serialize an int[]
ObjectOutputStream out = new ObjectOutputStream(new FileOutputStream("test.ser"));
out.writeObject(new int[] {0, 1, 2, 3, 4, 5, 6, 7, 8, 9});
out.flush();
out.close();
// Deserialize the int[]
ObjectInputStream in = new ObjectInputStream(new FileInputStream("test.ser"));
int[] array = (int[]) in.readObject();
in.close();
// Print out contents of deserialized int[]
System.out.println("It is " + (array instanceof Serializable) + " that int[] implements Serializable");
System.out.print("Deserialized array: " + array[0]);
for (int i=1; i<array.length; i++) {
System.out.print(", " + array[i]);
}
System.out.println();
}
}
As for what data type to store it as in MySQL, there are only four blob types to choose from:
The four BLOB types are TINYBLOB, BLOB, MEDIUMBLOB, and LONGBLOB
Choose the best one depends on the size of the serialized object. I'd imagine BLOB would be good enough.

Categories