Write string to huge file - java

I know there are several threads about this problem but i think my problem is a little bit different because of the size.
In my example I want to write 1,7 million lines to an text file. In worst case there could be much more. This lines are create for the sql loader to load fast data into a table so the file could be very large because sql loader could handle that.
Now I want to write the big file as fast as I could. This is my actually method:
BufferedWriter bw = new BufferedWriter(new FileWriter("out.txt"),40000);
int u=profils.size()-1;
for(int z=0; z<u;z++){
for(int b=0;b<z;b++){
p = getValue();
if(!Double.isNaN(p) & p > 0.55){
bw.write(map.get(z) + ";" + map.get(b) + ";" + p + "\n");
}
}
}
bw.close();
For my 1,7 million lines I need about 20 minutes. Can I handle that faster with any method that I don't know?
FileChannel:
File out = new File("out.txt");
FileOutputStream fileOutputStream = new FileOutputStream(out, true);
FileChannel fileChannel = fileOutputStream.getChannel();
ByteBuffer byteBuffer = null;
int u=profils.size()-1;
for(int z=0; z<u;z++){
for(int b=0;b<z;b++){
p = getValue();
if(!Double.isNaN(p) & p > 0.55){
String str = indexToSubstID.get(z) + ";" + indexToSubstID.get(b) + ";" + p + "\n";
byteBuffer = ByteBuffer.wrap(str.getBytes(Charset.forName("ISO-8859-1")));
fileChannel.write(byteBuffer);
}
}
}
fileOutputStream.close();

FileChannel is your way to go. It is used for huge amount of writes.
Read the api documentation
here

Related

How to append to DataOutputStream in Java?

I want my program to save URL addresses, one at a time, to a file.
These addresses need to be saved in UTF format to ensure they are correct.
My problem is that the file is overwritten all the time, instead of appended:
DataOutputStream DOS = new DataOutputStream(new FileOutputStream(filen, true));
Count = 0;
if (LinkToCheck != null) {
System.out.println(System.currentTimeMillis() + " SaveURL_ToRelatedURLS d "+LinkToCheck.Get_SelfRelationValue()+" vs "+Class_Controller.InterestBorder);
if (LinkToCheck.Get_SelfRelationValue() > Class_Controller.InterestBorder) {
DOS.writeUTF(LinkToCheck.Get_URL().toString() + "\n");
Count++;
}
}
DOS.close();
This code does NOT append, so how do I make it append?
You actually should not keep the stream open and write on every iteration. Why don't you simply create a string that contains all the information and write it at the end?
Example:
DataOutputStream DOS = new DataOutputStream(new FileOutputStream(filen, true));
int count = 0; // variables should be camelcase btw
StringBuilder resultBuilder = new StringBuilder();
if (LinkToCheck != null) {
System.out.println(System.currentTimeMillis() + "SaveURL_ToRelatedURLS d "+LinkToCheck.Get_SelfRelationValue()+" vs "+Class_Controller.InterestBorder);
if (LinkToCheck.Get_SelfRelationValue() > Class_Controller.InterestBorder) {
resultBuilder.append(LinkToCheck.Get_URL().toString()).append("\n");
count++;
}
}
DOS.writeUTF(resultBuilder.toString());
DOS.close();
Hope that helps.
You can achieve this without the DataOutputStream.
Here's a simplified example using just the FileOutputStream:
String filen = "C:/testfile.txt";
FileOutputStream FOS = new FileOutputStream(filen , true);
FOS.write(("String" + "\r\n").getBytes("UTF-8"));
FOS.close();
This will just write "String" everytime, but you should get the idea.
The problem turned out to be that I had forgot I put "filen.delete();" somewhere else.
This is why you(I) should take breaks while coding :p

For-loop not executing File writing commands

I have a program that is supposed to generate a file with a random integer name and append 200 characters of data to each file. I have already succeeded in being able to create a file:
BufferedOutputStream bos = new BufferedOutputStream(Files.newOutputStream(
new File("C:/Users/mirvine/Desktop/SPAM").toPath()));
And I have gotten it to write chars to the file:
bos.write((char)rand.nextInt(255));
But when I combine the two with a for loop, it doesn't work:
try {
while(true) {
int i = rand.nextInt();
File outputFile = new File("C:/Users/mirvine/Desktop/SPAM/"+ String.valueOf(i) +".txt");
bos = new BufferedOutputStream(Files.newOutputStream(outputFile.toPath()));
PrintWriter writer = new PrintWriter(bos);
for(int qu = 0; qu <= 2000; qu++) {
writer.write((char)rand.nextInt(255));
System.out.println("Total " + String.valueOf(qu) + " characters written to " + String.valueOf(i) + ".txt!");
}
System.out.println("File named \'" + String.valueOf(i) + ".txt\' created!");
}
} catch (Exception e) {e.printStackTrace(); return;}
I will get the output "Total: (number) characters written to (whatever).txt!" but it won't actually write the characters. However, if I make the loop infinite (by changing qu++ to qu--) it will write the characters, of course with only one file though. I even tried changing it to a while loop and it didn't work. Any ideas?
Consider changing the use of BufferedOutputStream and PrintWriter for FileWriter, which will take your file as an argument to the constructor.
Also, make sure you flush and close the stream after finishing with it.
You should use a BufferedWriter, it's more efficient, and don't forget to close it at the end. The infinite loop is useless.
int i = rand.nextInt();
File outputFile = new File("C:/Users/mirvine/Desktop/SPAM/"+ String.valueOf(i) +".txt");
FileOutputStream fos = new FileOutputStream(outputFile);
BufferedWriter bw = new BufferedWriter(new OutputStreamWriter(fos));  
for(int qu = 0; qu <= 2000; qu++)
{
bw.write((char)rand.nextInt(255));
System.out.println("Total " + String.valueOf(qu) + " characters written to " + String.valueOf(i) + ".txt!");
}
bw.close();

split very large text file by max rows

I want to split a huge file containing strings into a set of new (smaller) file and tried to use nio2.
I do not want to load the whole file into memory, so I tried it with BufferedReader.
The smaller text files should be limited by the number of text rows.
The solution works, however I want to ask if someone knows a solution with better performance by usion java 8 (maybe lamdas with stream()-api?) and nio2:
public void splitTextFiles(Path bigFile, int maxRows) throws IOException{
int i = 1;
try(BufferedReader reader = Files.newBufferedReader(bigFile)){
String line = null;
int lineNum = 1;
Path splitFile = Paths.get(i + "split.txt");
BufferedWriter writer = Files.newBufferedWriter(splitFile, StandardOpenOption.CREATE);
while ((line = reader.readLine()) != null) {
if(lineNum > maxRows){
writer.close();
lineNum = 1;
i++;
splitFile = Paths.get(i + "split.txt");
writer = Files.newBufferedWriter(splitFile, StandardOpenOption.CREATE);
}
writer.append(line);
writer.newLine();
lineNum++;
}
writer.close();
}
}
Beware of the difference between the direct use of InputStreamReader/OutputStreamWriter and their subclasses and the Reader/Writer factory methods of Files. While in the former case the system’s default encoding is used when no explicit charset is given, the latter always default to UTF-8. So I strongly recommend to always specify the desired charset, even if it’s either Charset.defaultCharset() or StandardCharsets.UTF_8 to document your intention and avoid surprises if you switch between the various ways to create a Reader or Writer.
If you want to split at line boundaries, there is no way around looking into the file’s contents. So you can’t optimize it the way like when merging.
If you are willing to sacrifice the portability you could try some optimizations. If you know that the charset encoding will unambiguously map '\n' to (byte)'\n' as it’s the case for most single byte encodings as well as for UTF-8 you can scan for line breaks on the byte level to get the file positions for the split and avoid any data transfer from your application to the I/O system.
public void splitTextFiles(Path bigFile, int maxRows) throws IOException {
MappedByteBuffer bb;
try(FileChannel in = FileChannel.open(bigFile, READ)) {
bb=in.map(FileChannel.MapMode.READ_ONLY, 0, in.size());
}
for(int start=0, pos=0, end=bb.remaining(), i=1, lineNum=1; pos<end; lineNum++) {
while(pos<end && bb.get(pos++)!='\n');
if(lineNum < maxRows && pos<end) continue;
Path splitFile = Paths.get(i++ + "split.txt");
// if you want to overwrite existing files use CREATE, TRUNCATE_EXISTING
try(FileChannel out = FileChannel.open(splitFile, CREATE_NEW, WRITE)) {
bb.position(start).limit(pos);
while(bb.hasRemaining()) out.write(bb);
bb.clear();
start=pos;
lineNum = 0;
}
}
}
The drawbacks are that it doesn’t work with encodings like UTF-16 or EBCDIC and, unlike BufferedReader.readLine() it won’t support lone '\r' as line terminator as used in old MacOS9.
Further, it only supports files smaller than 2GB; the limit is likely even smaller on 32Bit JVMs due to the limited virtual address space. For files larger than the limit, it would be necessary to iterate over chunks of the source file and map them one after another.
These issues could be fixed but would raise the complexity of this approach. Given the fact that the speed improvement is only about 15% on my machine (I didn’t expect much more as the I/O dominates here) and would be even smaller when the complexity raises, I don’t think it’s worth it.
The bottom line is that for this task the Reader/Writer approach is sufficient but you should take care about the Charset used for the operation.
I made a slight modification to #nimo23 code, taking into account the option of adding a header and a footer for each of the split files, also it output the files into a directory with the same name as the original file with _split appended to it. the code below:
public static void splitTextFiles(String fileName, int maxRows, String header, String footer) throws IOException
{
File bigFile = new File(fileName);
int i = 1;
String ext = fileName.substring(fileName.lastIndexOf("."));
String fileNoExt = bigFile.getName().replace(ext, "");
File newDir = new File(bigFile.getParent() + "\\" + fileNoExt + "_split");
newDir.mkdirs();
try (BufferedReader reader = Files.newBufferedReader(Paths.get(fileName)))
{
String line = null;
int lineNum = 1;
Path splitFile = Paths.get(newDir.getPath() + "\\" + fileNoExt + "_" + String.format("%03d", i) + ext);
BufferedWriter writer = Files.newBufferedWriter(splitFile, StandardOpenOption.CREATE);
while ((line = reader.readLine()) != null)
{
if(lineNum == 1)
{
writer.append(header);
writer.newLine();
}
writer.append(line);
writer.newLine();
lineNum++;
if (lineNum > maxRows)
{
writer.append(footer);
writer.close();
lineNum = 1;
i++;
splitFile = Paths.get(newDir.getPath() + "\\" + fileNoExt + "_" + String.format("%03d", i) + ext);
writer = Files.newBufferedWriter(splitFile, StandardOpenOption.CREATE);
}
}
if(lineNum <= maxRows) // early exit
{
writer.append(footer);
}
writer.close();
}
System.out.println("file '" + bigFile.getName() + "' split into " + i + " files");
}

Shouldn't the content of a file remain unaltered if I read it byte-by-byte?

Why does the following code alter "öäüß"? (I am using it to split big files into multiple small ones...)
InputStream is = new BufferedInputStream(new FileInputStream(file));
File newFile;
BufferedWriter bw;
newFile = new File(filePathBase + "." + String.valueOf(files.size() + 1) + fileExtension);
files.add(newFile);
bw = new BufferedWriter(new FileWriter(newFile));
try {
byte[] c = new byte[1024];
int lineCount = 0;
int readChars = 0;
while ( ( readChars = is.read(c) ) != -1 )
for ( int i=0; i<readChars; i++ ) {
bw.write(c[i]);
if ( c[i] == '\n' )
if ( ++lineCount % linesPerFile == 0 ) {
bw.close();
newFile = new File(filePathBase + "." + String.valueOf(files.size() + 1) + fileExtension);
files.add(newFile);
bw = new BufferedWriter(new FileWriter(newFile));
}
}
} finally {
bw.close();
is.close();
}
My understanding of character encoding is that everything should remain the same as long as i keep each byte the same. Why does this code alter bytes?
Thanks a bunch in advance~
==================== SOLUTION ====================
The mistake is that FileWriter interprets bytes and shouldn't be used to just output plain bytes, thanks #meriton and #Jonathan Rosenne. Just changing everything to BufferedOutputStream doesn't do it though, since BufferedOutputStream is too slow! I ended up improving my file split-and-copy code to include a bigger read-array size and only write() when necessary ...
File newFile = new File(filePathBase + "." + String.valueOf(files.size() + 1) + fileExtension);
files.add(newFile);
InputStream iS = new BufferedInputStream(new FileInputStream(file));
OutputStream oS = new FileOutputStream(newFile); // BufferedOutputStream wrapper toooo slow!
try {
byte[] c;
if ( linesPerFile > 65536 )
c = new byte[65536];
else
c = new byte[1024];
int lineCount = 0;
int readChars = 0;
while ( ( readChars = iS.read(c) ) != -1 ) {
int from = 0;
for ( int idx=0; idx<readChars; idx++ )
if ( c[idx] == '\n' && ++lineCount % linesPerFile == 0 ) {
oS.write(c, from, idx+1 - from);
oS.close();
from = idx+1;
newFile = new File(filePathBase + "." + String.valueOf(files.size() + 1) + fileExtension);
files.add(newFile);
oS = new FileOutputStream(newFile);
}
oS.write(c, from, readChars - from);
}
} finally {
iS.close();
oS.close();
}
An InputStream reads bytes, an OutputStream writes them. A Reader reads characters, a Writer writes them.
You read with an InputStream, and write with a FileWriter. That is, you read bytes, but write characters. Specifically,
bw.write(c[i]);
invokes the method
public void write(int c) throws IOException
whose Javadoc says:
Writes a single character. The character to be written is contained in the 16 low-order bits of the given integer value; the 16 high-order bits are ignored.
That is, the byte is implicitly converted into an int, and then reinterpreted as a unicode code point, which is then written to the file using the platform default encoding (because you don't specify the encoding the FileWriter should use).
You are reading bytes and writing characters. The line bw.write(c[i]); assumes each byte is a character, but in the input file this is not necessarily so, it depends on the encoding that was used. Encodings such as UTF-8 may use 2 or more bytes per character, and you are converting each byte individually. For example, in UTF-8, ö is encoded as 2 bytes, hexadecimal c3 b6. As you process them individually, you may see the first character as Ã.
Try to debug your while condition ( readChars = is.read(c) ) != -1 , because of this it is getting into infinite loop and bw.close(); is never called and file is still in read mode and if at the same time you are trying to perform some operation file will get corrupt and you will get undesired result.

Java - Scanner not scanning after a certain number of lines

I'm doing some relatively simple I/O in Java. I have a .txt files that I'm reading from using a Scanner and a .txt file I'm writing to using a BufferedWriter. Another Scanner then reads that file and another BufferedWriter then creates another .txt file. I've provided the code below just in case, but I don't know if it will help too much, as I don't think the code is the issue here. The code compiles without any errors, but it's not doing what I expect it to. For some reason, charReader will only read about half of its file, then hasNext() will return false, even though the end of the file hasn't been reached. These aren't big text files - statsReader's file is 34 KB and charReader's file is 29 KB, which is even weirder, because statsReader reads its entire file fine, and it's bigger! Also, I do have that code surrounded in a try/catch, I just didn't include it.
From what I've looked up online, this may happen with very large files, but these are quite small, so I'm pretty lost.
My OS is Windows 7 64-bit.
Scanner statsReader = new Scanner(statsFile);
BufferedWriter statsWriter = new BufferedWriter(new FileWriter(outputFile));
while (statsReader.hasNext()) {
statsWriter.write(statsReader.next());
name = statsReader.nextLine();
temp = statsReader.nextLine();
if (temp.contains("form")) {
name += " " + temp;
temp = statsReader.next();
}
statsWriter.write(name);
statsWriter.newLine();
statsWriter.write(temp);
if (! (temp = statsReader.next()).equals("-"))
statsWriter.write("/" + temp);
statsWriter.write("\t");
statsWriter.write(statsReader.nextInt() + "\t");
statsWriter.write(statsReader.nextInt() + "\t");
statsWriter.write(statsReader.nextInt() + "\t");
statsWriter.write(statsReader.nextInt() + "\t");
statsWriter.write(statsReader.nextInt() + "\t");
statsWriter.write(statsReader.nextInt() + "");
statsWriter.newLine();
statsReader.nextInt();
}
Scanner charReader = new Scanner(charFile);
BufferedWriter codeWriter = new BufferedWriter(new FileWriter(codeFile));
while (charReader.hasNext()) {
color = charReader.next();
name = charReader.nextLine();
name = name.replaceAll("\t", "");
typing = pokeReader.next();
place = charReader.nextInt();
area = charReader.nextInt();
def = charReader.nextInt();
shape = charReader.nextInt();
size = charReader.nextInt();
spe = charReader.nextInt();
index = typing.indexOf('/');
if (index == -1) {
typeOne = determineType(typing);
typeTwo = '0';
}
else {
typeOne = determineType(typing.substring(0, index));
typeTwo = determineType(typing.substring(index+1, typing.length()));
}
}
SSCCE:
public class Tester {
public static void main(String[] args) {
File statsFile = new File("stats.txt");
File testFile = new File("test.txt");
try {
Scanner statsReader = new Scanner(statsFile);
BufferedWriter statsWriter = new BufferedWriter(new FileWriter(testFile));
while (statsReader.hasNext()) {
statsWriter.write(statsReader.nextLine());
statsWriter.newLine();
}
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
}
This is a classic problem: You need to flush and close the output stream (in this case statsWriter) before reading the file.
Being buffered, it doesn't actually write to the file with ever call to write. Calling flush forces it to complete any pending write operations.
Here's the javadoc for OutputStream.flush():
Flushes this output stream and forces any buffered output bytes to be written out. The general contract of flush is that calling it is an indication that, if any bytes previously written have been buffered by the implementation of the output stream, such bytes should immediately be written to their intended destination.
After you have written your file with your statsWriter, you need to call:
statsWriter.flush();
statsWriter.close();
or simply:
statsWriter.close(); // this will call flush();
This is becuase your are using a Buffered Writer, it does not write everything out to the file as you call the write functions, but rather in buffered chunks. When you call flush() and close(), it empties all the content it still has in it's buffer out to the file, and closes the stream.
You will need to do the same for your second writer.

Categories