What I want to do...
I have XML-Files with names like SomeName999999blablabla.xml with lots of content, where almost every line contains the string "999999". I need identical xml-files where 999999 is replaced by 888888, 777777, and so on, in the name and the file's content.
The problem...
My code works fine and actually creates all the files I need, BUT there are sometimes tiny errors. Like in one line an E is "randomly" replaced by a D (it seems to be always one letter lower than what its supposed to be, but I can't confirm that 100%). Its not a lot, like one or two instances in 60 files, each file being about 100MB. But since its an xml this is a real problem, as this often is a schema violation, which causes a crash in later processing.
I have absolutely no idea where this is coming from or how to fix it, please help.
My code so far...
private void createMandant(String mandant) throws Exception {
String line;
File dir = new File(TestConstants.getXmlDirectory());
for (File file : dir.listFiles()) {
if (file.getName().endsWith((".xml")) && file.getName().contains("999999")) {
BufferedReader br = new BufferedReader(new FileReader(file));
FileWriter fw = new FileWriter(file.getAbsolutePath().replace("999999", mandant));
while ((line = br.readLine()) != null) {
fw.write(line.replace("999999", mandant) + "\r\n");
}
br.close();
fw.close();
}
}
}
Environment...
We are on Java 6. As mentioned before the files are quite large. Like 100MB, several hundred thousand lines each.
It appears to be a problem with String.replace()
I have replaced it with StringBuilder:
while ((line = br.readLine()) != null) {
index = 0;
// fw.write(line.replace("999999", mandant) + "\r\n");
StringBuilder builder = new StringBuilder(line);
index = builder.indexOf("999999");
if (index > 0) {
fw.write(builder.replace(index, index + 6, mandant).toString() + "\r\n");
} else {
fw.write(line + "\r\n");
}
}
... and now it seems to work. Two runs have already completed without any problems.
But that seems very strange. Could it really be that a heavily used function like String.replace() just randomly gets single letters wrong every few million method calls?
Related
I have this app where i process a very large file, Extract the lines that have the same first 5 characters (i call this currentlineId ), use them to create an object and do something with it, example sample of the file contents:
AZDFS12345678998765432345678
AZDFS09876545432345678987654
AZDFS34568987654567890987654
AZDFS12345670987654345678998
AZDFS12345098734567765123456
// the lines above have the same first 5 characters, they create Object1.
FGHJUY121324
FGHJUY089909
FGHJUYTTUTUU
//same for the lines above, they create Object2.
NB: the lines will always be in an order where lines with the same first 5 will always be together (abover/below each other) so i wonn't have lines all over the place
My current function code:
private void processScpFile(File file) {
LOGGER.info("Processing File: {} ", file.getName());
try (var br = new BufferedReader(new FileReader(file))) {
String currentLine;
String lastLineId = null;
List<String> similarLineIdsList = new ArrayList<>();
while ((currentLine = br.readLine()) != null) {
if (StringUtils.isEmpty(lastLineId)) {
lastLineId = currentLine.substring(0,5);
}
if (lastLineId.equals(currentLine.substring(0,5))) {
similarLineIdsList.add(currentLine);
}
else if (!lastLineId.equals(currentLine.substring(0,5))) {
doSomethinsWithTheList(similarLineIdsList);
similarLineIdsList.clear();
similarLineIdsList.add(currentLine);
lastLineId= currentLine.substring(0,5);
}
}
doSomethinsWithTheList(similarLineIdsList);
}
catch (IOException e) {
LOGGER.error("Couldn't read file, {}", e.getMessage(), e);
}
}
Now this has worked well up until now, but going forward i have to process files where i would have for instance over 100k lines with same first 5, which makes this process very slow.
Please do you have any suggestion on haow to make this process faster, thank you
Edit: just to be precise it's the generating the list with the same first 5 chars that's slower as the number of similar lines gets larger.
Hopefully my explanation does me some justice. I am pretty new to java. I have a text file that looks like this
Java
The Java Tutorials
http://docs.oracle.com/javase/tutorial/
Python
Tutorialspoint Java tutorials
http://www.tutorialspoint.com/python/
Perl
Tutorialspoint Perl tutorials
http://www.tutorialspoint.com/perl/
I have properties for language name, website description, and website url. Right now, I just want to list the information from the text file exactly how it looks, but I need to assign those properties to them.
The problem I am getting is "index 1 is out of bounds for length 1"
try {
BufferedReader in = new BufferedReader(new FileReader("Tutorials.txt"));
while (in.readLine() != null) {
TutorialWebsite tw = new TutorialWebsite();
str = in.readLine();
String[] fields = str.split("\\r?\\n");
tw.setProgramLanguage(fields[0]);
tw.setWebDescription(fields[1]);
tw.setWebURL(fields[2]);
System.out.println(tw);
}
} catch (IOException e) {
e.printStackTrace();
}
I wanted to test something so i removed the new lines and put commas instead and made it str.split(",") which printed it out just fine, but im sure i would get points taken off it i changed the format.
readline returns a "string containing the contents of the line, not including any line-termination characters", so why are you trying to split each line on "\\r?\\n"?
Where is str declared? Why are you reading two lines for each iteration of the loop, and ignoring the first one?
I suggest you start from
String str;
while ((str = in.readLine()) != null) {
System.out.println(str);
}
and work from there.
The first readline gets the language, the second gets the description, and the third gets the url, and then the pattern repeats. There is nothing to stop you using readline three times for each iteration of the while loop.
you can read all the file in a String like this
// try with resources, to make sure BufferedReader is closed safely
try (BufferedReader in = new BufferedReader(new FileReader("Tutorials.txt"))) {
//str will hold all the file contents
StringBuilder str = new StringBuilder();
String line;
while ((line = in.readLine()) != null) {
str.append(line);
str.append("\n");
} catch (IOException e) {
e.printStackTrace();
}
Later you can split the string with
String[] fields = str.toString().split("[\\n\\r]+");
Why not try it like this.
allocate a List to hold the TutorialWebsite instances.
use try with resources to open the file, read the lines, and trim any white space.
put the lines in an array
then iterate over the array, filling in the class instance
the print the list.
The loop ensures the array length is a multiple of nFields, discarding any remainder. So if your total lines are not divisible by nFields you will not read the remainder of the file. You would still have to adjust the setters if additional fields were added.
int nFields = 3;
List<TutorialWebsite> list = new ArrayList<>();
try (BufferedReader in = new BufferedReader(new FileReader("tutorials.txt"))) {
String[] lines = in.lines().map(String::trim).toArray(String[]::new);
for (int i = 0; i < (lines.length/nFields)*nFields; i+=nFields) {
TutorialWebsite tw = new TutorialWebsite();
tw.setProgramLanguage(lines[i]);
tw.setWebDescription(lines[i+1]);
tw.setWebURL(lines[i+2]);
list.add(tw);
}
} catch (IOException ioe) {
ioe.printStackTrace();
}
list.forEach(System.out::println);
A improvement would be to use a constructor and pass the strings to that when each instance is created.
And remember the file name as specified is relative to the directory in which the program is run.
Following this answer -->
How do I sort very large files
I need only the Merge function on N already sorted files on disk ,
I want to sort them into one Big file my limitation is the memory Not more than K lines in the memory (K < N) so i cannot fetch all them and then sort, preferred with java
so far I Tried as the code below , but I need a good way to iterate over all N of files line by line (not more than K LINES in memory) + store to disk the sorted final file
public void run() {
try {
System.out.println(file1 + " Started Merging " + file2 );
FileReader fileReader1 = new FileReader(file1);
FileReader fileReader2 = new FileReader(file2);
//......TODO with N ?? ......
FileWriter writer = new FileWriter(file3);
BufferedReader bufferedReader1 = new BufferedReader(fileReader1);
BufferedReader bufferedReader2 = new BufferedReader(fileReader2);
String line1 = bufferedReader1.readLine();
String line2 = bufferedReader2.readLine();
//Merge 2 files based on which string is greater.
while (line1 != null || line2 != null) {
if (line1 == null || (line2 != null && line1.compareTo(line2) > 0)) {
writer.write(line2 + "\r\n");
line2 = bufferedReader2.readLine();
} else {
writer.write(line1 + "\r\n");
line1 = bufferedReader1.readLine();
}
}
System.out.println(file1 + " Done Merging " + file2 );
new File(file1).delete();
new File(file2).delete();
writer.close();
} catch (Exception e) {
System.out.println(e);
}
}
regards,
You can use something like this
public static void mergeFiles(String target, String... input) throws IOException {
String lineBreak = System.getProperty("line.separator");
PriorityQueue<Map.Entry<String,BufferedReader>> lines
= new PriorityQueue<>(Map.Entry.comparingByKey());
try(FileWriter fw = new FileWriter(target)) {
String header = null;
for(String file: input) {
BufferedReader br = new BufferedReader(new FileReader(file));
String line = br.readLine();
if(line == null) br.close();
else {
if(header == null) fw.append(header = line).write(lineBreak);
line = br.readLine();
if(line != null) lines.add(new AbstractMap.SimpleImmutableEntry<>(line, br));
else br.close();
}
}
for(;;) {
Map.Entry<String, BufferedReader> next = lines.poll();
if(next == null) break;
fw.append(next.getKey()).write(lineBreak);
final BufferedReader br = next.getValue();
String line = br.readLine();
if(line != null) lines.add(new AbstractMap.SimpleImmutableEntry<>(line, br));
else br.close();
}
}
catch(Throwable t) {
for(Map.Entry<String,BufferedReader> br: lines) try {
br.getValue().close();
} catch(Throwable next) {
if(t != next) t.addSuppressed(next);
}
}
}
Note that this code, unlike the code in your question, handles the header line. Like the original code, it will delete the input lines. If that’s not intended, you can remove the DELETE_ON_CLOSE option and simplify the entire reader construction to
BufferedReader br = new BufferedReader(new FileReader(file));
It has exactly as much lines in memory, as you have files.
While in principle, it is possible to hold less line strings in memory, to re-read them when needed, it would be a performance disaster for a questionable little saving. E.g. you have already N strings in memory when calling this method, due to the fact that you have N file names.
However, when you want to reduce the number of lines held at the same time, at all costs, you can simply use the method shown in your question. Merge the first two files into a temporary file, merge that temporary file with the third to another temporary file, and so on, until merging the temporary file with the last input file to the final result. Then you have at most two line strings in memory (K == 2), saving less memory than the operating system will use for buffering, trying to mitigate the horrible performance of this approach.
Likewise, you can use the method shown above to merge K files into a temporary file, then merge the temporary file with the next K-1 file, and so on, until merging the temporary file with the remaining K-1 or less files to the final result, to have a memory consumption scaling with K < N. This approach allows to tune K to have a reasonable ratio to N, to trade memory for speed. I think, in most practical cases, K == N will work just fine.
#Holger gave a nice answer assuming that K>=N.
You can extend it to the K<N case by using mark(int) and reset() methods of the BufferedInputStream.
The parameter of mark is how many bytes a single line can have.
The idea is as follows:
Instead of putting all the N lines in the TreeMap, you can only have K of them. Whenever you put a new line into the set and it is already 'full' you evict the smallest one from it. Additionally, you reset the stream from which it came. So when you will read it again the same data can pop up.
You have to keep track of the maximum line not kept in the TreeSet, lets call it the lower bound. Once there are no elements in the TreeSet greater than the maintained lower bound, you scan all the files once again and repopulate the set.
I'm not sure if this approach is optimal, but should be ok.
Moreover, you have to be aware that BufferedInputStream has an internal buffer at least the size of a single line, so that will consume a lot of your memory, perhaps it would be better to maintain buffering on your own.
This is my debut question here, so I will try to be as clear as I can.
I have a sentences.txt file like this:
Galatasaray beat Juventus 1-0 last night.
I'm going to go wherever you never can find me.
Papaya is such a delicious thing to eat!
Damn lecturer never gives more than 70.
What's in your mind?
As obvious there are 5 sentences, and my objective is to write a listSize method that returns the number of sentences listed here.
public int listSize()
{
// the code is supposed to be here.
return sentence_total;}
All help is appreciated.
To read a file and count its lines, use a java.io.LineNumberReader, plugged on top of a FileReader. Call readLine() on it until it returns null, then getLineNumber() to know the last line number, and you're done !
Alternatively (Java 7+), you can use the NIO2 Files class to fully read the file at once into a List<String>, then return the size of that list.
BTW, I don't understand why your method takes that int as a parameter, it it's supposed to be the value to compute and return ?
Using LineNumberReader:
LineNumberReader reader = new LineNumberReader(new FileReader(new File("sentences.txt")));
reader.skip(Long.MAX_VALUE);
System.out.println(reader.getLineNumber() + 1); // +1 because line index starts at 0
reader.close();
use the following code to get number of lines in that file..
try {
File file = new File("filePath");
BufferedReader reader = new BufferedReader(new FileReader(file));
String line;
int totalLines = 0;
while((line = reader.readLine()) != null) {
totalLines++;
}
reader.close();
System.out.println(totalLines);
} catch (Exception ex) {
ex.printStackTrace(System.err);
}
You could do:
Path file = Paths.getPath("route/to/myFile.txt");
int numLines = Files.readAllLlines(file).size();
If you want to limit them or process them lazily:
Path file = Paths.getPath("route/to/myFile.txt");
int numLines = Files.llines(file).limit(maxLines).collect(Collectors.counting...);
I'm trying to learn Java/Android and right now I'm doing some experiments with the replaceAll function. But I've found that with large text files the process gets sluggish so I was wondering if there is a way to skip the "useless" parts of a file to have a better performance. (Note: Just skip them, not delete them)
Note: I am not trying to "count lines" or "println" or "system.out", I'm just replacing strings and saving the changes in the same file.
Example
AAAA
CCCC- 9234802394819102948102948104981209381'238901'2309'129831'2381'2381'23081'23081'284091824098304982390482304981'20841'948023984129048'1489039842039481'204891'29031'923481290381'20391'294872385710239841'20391'20931'20853029573098341'290831'20893'12894093274019799919208310293810293810293810293810298'120931¿2093¿12039¿120931¿203912¿0391¿203912¿039¿12093¿12093¿12093¿12093¿12093¿1209312¿0390¿... DDDD
AAAA
CCCC- 9234802394819102948102948104981209381'238901'2309'129831'2381'2381'23081'23081'284091824098304982390482304981'20841'948023984129048'1489039842039481'204891'29031'923481290381'20391'294872385710239841'20391'20931'20853029573098341'290831'20893'12894093274019799919208310293810293810293810293810298'120931¿2093¿12039¿120931¿203912¿0391¿203912¿039¿12093¿12093¿12093¿12093¿12093¿1209312¿0390¿... DDDD
and so on....like a zillion times
I want to replace all "AAAA" with "BBBB", but there are large portions of data between the strings I am replacing. Also, this portions always begin with "CCCC" and end with "DDDD".
Here's the code I am using to replace the string.
File file = new File("my_file.txt");
BufferedReader reader = new BufferedReader(new FileReader(file));
String line = "", oldtext = "";
while((line = reader.readLine()) != null) {
oldtext += line + "\r\n";
}
reader.close();
// Replacing "AAAA" strings
String newtext= oldtext.replaceAll("AAAA", "BBBB");
FileWriter writer = new FileWriter("my_file.txt");
writer.write(newtext);
writer.close();
I think reading all lines is inefficient, especially when you won't be modifying these parts (and they represent the 90% of the file).
Does anyone know a solution???
You are wasting a lot of time on this line --
oldtext += line + "\r\n";
In Java, String is immutable, which means you can't modify them. Therefore, when you do the concatenation, Java is actually making a complete copy of oldtext. So, for every line in your file, you are recopying every line that came before in your new String. Take a look at StringBuilder for a a way to build a String avoiding these copies.
However, in your case, you do not need the whole file in memory, because you can process line by line. By moving your replaceAll and write into your loop, you can operate on each line as you read it. This will keep the memory footprint of the routine down, because you are only keeping a single line in memory.
Note that since the FileWriter is opened before you read the input file, you need to have a different name for the output file. If you want to keep the same name, you can do a renameTo on the File after you close it.
File file = new File("my_file.txt");
BufferedReader reader = new BufferedReader(new FileReader(file));
FileWriter writer = new FileWriter("my_out_file.txt");
String line = "";
while((line = reader.readLine()) != null) {
// Replacing "AAAA" strings
String newtext= line.replaceAll("AAAA", "BBBB");
writer.write(newtext);
}
reader.close();
writer.close();