I'm writing a file translator for my company that grabs data from a source file and writes a bunch of delimited records to a target file. The records have the form:
HEADER*REC 1*REC 2*REC 3*REC 4
If a record is empty, and there is another record that can come after it, then the value is not printed, but the delimiter is included, e.g.:
HEADER*REC 1**REC 3*REC 4
If a record is empty, and it is the last record in the series, then the value and the delimiter are omitted, e.g.:
HEADER*REC 1*REC 2*REC 3
I was trying to think of a nice way to describe this in code, other than (pseudocode):
if last record is empty
print this
otherwise
print this other thing
I guess the code isn't too ugly, but I'd like a nicer solution. I'm using a StringBuilder to write the data for each transaction (each set of records corresponds to a transaction, so I can iterate through a TransactionSet Object.), and if I can, I try to avoid copious switch/if statements. If anyone knows of a more nicer, or elegant way to do this I would love to hear it.
EDIT: Clarified block of pseudocode
You can do it like this
System.out.print("HEADER");
StringBuilder sep = new StringBuilder();
for(String rec: headings) {
sep.append("*");
if(rec != null && !rec.isEmpty()) {
System.out.print(sep + rec);
sep.setLength(0);
}
}
System.out.println();
This way it will only print a "*" if you have a heading to come after it.
Related
My programm needs to read a file that has different data structures with a variable separator.
In my properties-file you can set the separator and put coordinates for values of different variables:
separator = ;
variable1 = 1,7
variable2 = 2,42
I would like to have a way where I can access a column and a line with some kind of coordinates.
I'm thinking of a syntax like this:
file.get(1,7,";")
(Which would give you the value of the 1st line and 7th column with the specific separator)
Does someone know a library or a code snippet that does exactly this?
Using String.split() :
public String get(File file, int lineNumber, int column, String separator ) {
//getting to the lineNumber of the file ommitted
// suppose you got it in a String named "line"
return line.split(separator)[column - 1];
}
You can use OpenCSV or SuperCSV for example. I'm not aware of any library that does your 'coordinates' gettings, but it's as simple as reading the CSV with the given separator as List-of-Lists and then call
csv.get(1).get(7)
Seems to be a simple file processing, You should first process the file -
create ArrayList<ArrayList<String>> processedFile
Read every line, split using "line".split(separator)
Store the array above in the ArrayList processedFile at current index
increase the index with every line
Once processedFile is ready, you can simply use processedFile.get(row).get(column). Also once the file is processed, all the other queries will be O(1). Hints are enough, try writing the code yourself, you will learn more.
PS: Take care of NullPointerExceptions wherever required.
I have two files which are very large in size say 50000 lines each. I need to compare these two files and identify the changes. However, the catch is if a line is present at different position, it should not be shown as different.
For eg, consider this
File A.txt
xxxxx
yyyyy
zzzzz
File B.txt
zzzzz
xxxx
yyyyy
So if this is the content of the file. My code should give the output as xxxx(or both xxxx and xxxxx).
Ofcourse the easiest way would be storing each line of the file in a
List< String>
and comparing with the other
List< String>.
But this seems to be taking a lot of time. I have also tried using the DiffUtils in java. But it doesnt recognize the lines present in diferent line numbers as same. So is there any other algorithm that might help me?
In general HashSet would be the best solution, but as we are dealing with strings there are two possible solutions:
saving one file as HashSet and trying to find the lines of other file in it.
saving one file as Trie and trying to find the lines of other file in it
In this post you can find comparison between HashSets and Tries How Do I Choose Between a Hash Table and a Trie (Prefix Tree)?
probably using Set is the easiest way:
Set<String> set1 = new HashSet<String>(FileUtils.readLines(file1));
Set<String> set2 = new HashSet<String>(FileUtils.readLines(file2));
Set<String> similars = new HashSet<String>(set1);
similars.retainAll(set2);
set1.removeAll(similars); //now set1 contains distinct lines in file1
set2.removeAll(similars); //now set2 contains distinct lines in file2
System.out.println(set1); //prints distinct lines in file1;
System.out.println(set2); //prints distinct lines in file2
You need to keep track of the case where the same record might appear more than once in the files. For example, if a record appears twice in file A and once in file B, then you need to record that as an extra record.
Since we have to keep track of the number of occurrences, you need one of:
A Multiset
A Map from record to Integer e.g. Map
With a Multiset, you can add and remove records and it will keep track of the number of times the record has been added (a Set doesn't do that - it rejects an add of a record that is already there). With the Map approach, you have to do a little bit of work so that the integer tracks the number of occurrences. let's consider that approach (the MultiSet is simpler).
With the map, when we talk about 'adding' a record, you look to see if there is an entry for that String in the Map. if there is, replace the value with value+1 for that key. If there isn't, create an entry with the value of 1. When we talk about 'removing an entry', look for an entry for that key. If you find it, replace the value with value-1. If that reduces the value to 0, remove the entry.
Create a Map for each file.
Read a record for one of the files
Check to see if that record exists in the other Map.
If it exists in the other Map, remove that entry (see above for what that means)
If it doesn't exist, add it to the Map for this file (see above)
Repeat until end, alternating files.
The contents of the two Maps will give you the records that appeared in that file but not the other.
Doing this as we go along, rather than building the Maps up front, keeps the memory usage down, but probably doesn't have a big impact on performance.
I think this will be useful,
BufferedReader reader1 = new BufferedReader(new FileReader("C:\\file1.txt"));
BufferedReader reader2 = new BufferedReader(new FileReader("C:\\file2.txt"));
String line1 = reader1.readLine();
String line2 = reader2.readLine();
boolean areEqual = true;
int lineNum = 1;
while (line1 != null || line2 != null)
{
if(line1 == null || line2 == null)
{
areEqual = false;
break;
}
else if(! line1.equalsIgnoreCase(line2))
{
areEqual = false;
break;
}
line1 = reader1.readLine();
line2 = reader2.readLine();
lineNum++;
}
if(areEqual)
{
System.out.println("Two files have same content.");
}
else
{
System.out.println("Two files have different content. They differ at line "+lineNum);
System.out.println("File1 has "+line1+" and File2 has "+line2+" at line "+lineNum);
}
reader1.close();
reader2.close();
You could try parsing your first file first, storing all of the lines in a HashMap and then checking whether there is a mapping present for each of the lines of the second file.
This is still O(n), though.
Just use a byte comparison with BufferedReader. This will be the fastest way to compare two files. Read a byte block from one file and compare it with the byte block of the other file. First check if the file length is the same.
Or just use FileUtils.contentEquals(file1, file2); from org.apache.commons.io.FileUtils.
You can use FileUtils.contentEquals(file1, file2)
It will compare the contents of the 2 files.
Find more information here
Assume I have a StringBuffer with values "1 \n 2 \n 3 \n...etc" where \n is a line break.
How would I add these values to an existing CSV file as a column using Java? Specifically, this would be the last column.
For example, let's say I have a CSV file that looks like this:
5, 2, 5
2, 3, 1
3, 5, 2
..
etc.
The output should look like this given the StringBuffer after using the method to add the column to the csv file:
5, 2, 5, 1
2, 3, 1, 2
3, 5, 2, 3
..
etc.
I also plan to add columns with 1000s of values so I am looking for something that does not have high memory consumption.
Thanks ahead of time.
Edit: Columns may be different sizes. I see people saying to add it at the end of each line. The problem is, it will add the values to the wrong columns and I cannot have that happen. I thank you all for your suggestions though as they were very good.
Edit 2: I have received critique about my use of StringBuffer and yes, I agree, if this problem were isolated, I would also suggest StringBuilder. The context of this problem is a program that has synchronized threads (acting as scenarios) collecting response times given a range of concurrent threads. The concurrent threads execute concurrent queries to a database and once the query has been executed, the result is appended to a StringBuffer. All the response times for each synchronized thread is appended to a StringBuffer and written to a CSV document. There can be several threads with the same response time. I can use StringBuilder but then I would have to manually synchronize the threads appending the response times and in my case, I do not think it would make too much of a difference in performance and would add an unnecessary amount of code. I hope this helps and I once again, thank you all for your concerns and suggestions. If after reading this, you are still not convinced that I should use StringBuffer, then I ask that we please take this discussion offline.
Edit 3: I have figured out how to go around the issue of adding the columns if the rows are different sizes. I simply add commas for every missing column (also note, that my rows would be growing with each column). It looks like #BorisTheSpider's conceptual solution actually works with this modification. The problem is I am not sure how to add the text at the end of each line. My code so far (I removed code to conserve space):
//Before this code there is a statement to create a test.csv file (this file has no values before this loop occurs).
for (int p = 0; p<(max+1); p = p + inc){
threadThis2(p);
//threadThis2 appends to the StringBuffer with several comma delimited values.
//p represents the number of threads/queries to execute at the same time.
comma = p/inc; //how many commas to put if there is nothing on the line.
for (int i = 0; i < comma; i++) {
commas.append(",");
}
br = new BufferedReader (new FileReader("test.csv"));
List <String> avg = Arrays.asList(sb.toString().split(", "));
for (int i = 0; i < avg.size(); i++) {
if (br.readLine()==null)
{w.write(commas.toString() + avg.get(i).toString() + ", \n");}
else { w.write(avg.get(i).toString() + ", \n");}
}
br.close();
sb.setLength(0);
commas.setLength(0);
}
Please note this code is in its early stages (I will of course declare all the variables outside the for loop later on). So far this code works. The problem is that the columns are not side by side, which is what I want. I understand I may be required to create temporary files but I need to approach this problem very carefully as I might need to have a lot of columns in the future.
Apparently there are two basic requirements:
Append a column to an existing CSV file
Allow concurrent operation
To achieve Requirement #1, the original file has to be read and rewritten as a new file, including the new column, irrespective of its location (i.e., in a StringBuffer or elsewhere).
The best (and only generic) way of reading a CSV file would be via a mature and field-proven library, such as OpenCSV, which is lightweight and commercially-friendly, given its Apache 2.0 license. Otherwise, one has to either do many simplifications (e.g., always assume single-line CSV records), or re-invent the wheel by implementing a new CSV parser.
In either case, a simple algorithm is needed, e.g.:
Initialize a CSV reader or parser object from the library used (or from whatever custom solution is used), supplying the existing CSV file and the necessary parameters (e.g., field separator).
Read the input file record-by-record, via the reader or parser, as a String[] or List<String> structure.
Manipulate the structure returned for every record to add or delete any extra fields (columns), in memory.
Add blank fields (i.e., just extra separators, 1 per field), if desired or needed.
Use a CSV writer from the library (or manually implement a writer) to write the new record to the output file.
Append a newline character at the end of each record written to the output file.
Repeat for all the records in the original CSV file.
This approach is also scalable, as it does not require any significant in-memory processing.
For Requirement #2, there are many ways of supporting concurrency and in this scenario it is more efficient to do it in a tailored manner (i.e., "manually" in the application), as opposed to relying on a thread-safe data structure like StringBuffer.
Is it possible to read a specific line using SuperCsv?
Suppose a .csv file contains 100 lines and i want to read line number 11.
CSV files usually contain variable-length records, which means it is impossible to "jump" to a specified record. The only solution is to sequentially read CSV records from the beginning of the file, while keeping a count, until you reach the needed record.
I have not found any special API in SuperCsv for doing this skipping of lines, so I guess you will have to manually call CsvListReader#read() method 11 times to get the line you want.
I don't know if other CSV reading libraries will have a "jump-to-line" feature, and even if they do, it is unlikely to perform any better than manually skipping to the required line, for the reason given in the first paragraph.
Here is a simple solution which you can adapt:
listReader = new CsvListReader(new InputStreamReader(new FileInputStream(CSVFILE, CHARSET), CsvPreference.TAB_PREFERENCE);
listReader.getHeader(false);
while ((listReader.read(processors)) != null) {
if (listReader.getLineNumber() == 1) {
System.out.println("Do whaever you need.");
}
}
I am trying to retrieve the data from the table and convert each row into CSV format like
s12, james, 24, 1232, Salaried
The below code does the job, but takes a long time, with tables of rows exceeding 1,00,000.
Please advise on optimizing technique:
while(rset1.next()!=false) {
sr=sr+"\n";
for(int j=1;j<=rsMetaData.getColumnCount();j++)
{
if(j< 5)
{
sr=sr+rset1.getString(j).toString()+",";
}
else
sr=sr+rset1.getString(j).toString();
}
}
/SR
Two approaches, in order of preference:
Stream the output
PrintWriter csvOut = ... // Construct a write from an outputstream, say to a file
while (rs.next())
csvOut.println(...) // Write a single line
(note that you should ensure that your Writer / OutputStream is buffered, although many are by default)
Use a StringBuilder
StringBuilder sb = new StringBuilder();
while (rs.next())
sb.append(...) // Write a single line
The idea here is that appending Strings in a loop is a bad idea. Imagine that you have a string. In Java, Strings are immutable. That means that to append to a string you have to copy the entire string and then write more to the end. Since you are appending things a little bit at a time, you will have many many copies of the string which aren't really useful.
If you're writing to a File, it's most efficient just to write directly out with a stream or a Writer. Otherwise you can use the StringBuilder which is tuned to be much more efficient for appending many small strings together.
I'm no Java expert, but I think it's always bad practice to use something like getColumnCount() in a conditional check. This is because after each loop, it runs that function to see what the column count is, instead of just referencing a static number. Instead, set a variable equal to that number and use the variable to compare against j.
You might want to use a StringBuilder to build the string, that's much more efficient when you're doing a lot of concatenation. Also if you have that much data, you might want to consider writing it directly to wherever you're going to put it instead of building it in memory at first, if that's a file or a socket, for example.
StringBuilder sr = new StringBuilder();
int columnCount =rsMetaData.getColumnCount();
while (rset1.next()) {
sr.append('\n');
for (int j = 1; j <= columnCount; j++) {
sr.append(rset1.getString(j));
if (j < 5) {
sr.append(',');
}
}
}
As a completely different, but undoubtely the most optimal alternative, use the DB-provided export facilities. It's unclear which DB you're using, but as per your question history you seem to be doing a lot with Oracle. In this case, you can export a table into a CSV file using UTL_FILE.
See also:
Generating CSV files using Oracle
Stored procedure example on Ask Tom
As the other answers say, stop appending to a String. In Java, String objects are immutable, so each append must do a full copy of the string, turning this into an O(n^2) operation.
The other is big slowdown is fetch size. By default, the driver is likely to fetch one row at a time. Even if this takes 1ms, that limits you to a thousand rows per second. A remote database, even on the same network, will be much worse. Try calling setFetchSize(1000) on the Statement. Beware that setting the fetch size too big can cause out of memory errors with some database drivers.
I don't believe minor code changes are going to make a substantive difference. I'd surely use a StringBuffer however.
He's going to be reading a million rows over a wire, assuming his database is on a separate machine. First, if performance is unacceptable, I'd run that code on the database server and clip the network out of the equation. If it's the sort of code that gets run once a week as a batch job that may be ok.
Now, what are you going to do with the StringBuffer or String once it is fully loaded from the database? We're looking at a String that could be 50 Mbyte long.
This should be 1 iota faster since it removes the unneeded (i<5) check.
StringBuilder sr = new StringBuilder();
int columnCount =rsMetaData.getColumnCount();
while (rset1.next()) {
for (int j = 1; j < columnCount; j++) {
sr.append(rset1.getString(j)).append(",");
}
// I suspect the 'if (j<5)' really meant, "if we aren't on the last
// column then tack on a comma." So we always tack it on above and
// write the last column and a newline now.
sr.append(rset1.getString(columnCount)).append("\n");
}
}
Another answer is to change the select so it returns a comma-sep string. Then we read the single-column result and append it to the StringBuffer.
I forget the syntax now, but something like:
select column1 || "," || column2 || "," ... from table;
Now we don't need to loop and comma concatenation business.
StringBuilder sr = new StringBuilder();
while (rset1.next()) {
sr.append(rset1.getString(1)).append("\n");
}
}