Good evening!
I have been reading through stuff on the internet for hours now but I can't find any way to get the conent of a file from the internet into an int array.
I got a .txt file (that I download from the internet) which is loaded over a BufferedStreamInput. There is a byte array which I tried to make use of it, but didn't have much success. Inside the file are random letters such as "abcCC". Now I would need the int value of each character (such as 97,98,99,67,67). I would add them to an array and then count how often a specific value appears. My problem tho is to get those values from the stream and I don't seem to find a way to do so.
Thank you for any ideas!
The Java API already contains a method that seems very convenient for you. InputStream defines the following method.
public abstract int read() throws IOException;
Reads the next byte of data from the input stream. The value byte is returned as an int in the range 0 to 255. If no byte is available because the end of the stream has been reached, the value -1 is returned. This method blocks until input data is available, the end of the stream is detected, or an exception is thrown.
It should make it trivial to read integers from a file one byte at a time (assuming characters to integers is one-to-one; int is actually four bytes in size).
An example of using this method to read characters individually as int and then casting to char follows. Again, this assumes that each character is encoded as a single byte, and that there is a one-to-one from characters to ints. If you're dealing with multi-byte character encoding and/or you want to support integers greater than 255, then the problem becomes more complex.
public static void main(String[] args) {
ByteArrayInputStream in = new ByteArrayInputStream("abc".getBytes());
int value;
while ((value = in.read()) > -1) {
System.out.println((char) value);
}
}
I would use a
java.util.Scanner.
You can initialize it with many different options including
File
InputStream
Readable
Later you can process the whole input line by line or in any way you like.
please refer to the great javadoc of Scanner.
Regards,
z1pi
Related
I am attempting to convert a program that reads a binary file in C++ to java. The file is in little-endian.
fread(n, sizeof (unsigned), 1, inputFile);
The snippet above of c++ reads 1 integer into the integer variable 'n'.
I am currently using this method to accomplish the same thing:
public static int readInt(RandomAccessFile inputStream) throws IOException {
int retVal;
byte[] buffer = new byte[4];
inputStream.readFully(buffer);
ByteBuffer wrapped = ByteBuffer.wrap(buffer);
wrapped.order(ByteOrder.LITTLE_ENDIAN);
retVal = wrapped.getInt();
return retVal;
}
but this method sometimes differs in its result to the c++ example. I haven't been able to determine which parts of the file cause this method to fail, but I know it does. For example, when reading one part of the file my readInt method returns 543974774 but the C++ version returns 1.
Is there a better way to read little endian values in Java? Or is there some obvious flaw in my implementation? Any help understanding where I could be going wrong, or how could I could read these values in a more effective way would be very appreciated.
Update:
I am using RandomAcccessFile because I frequently require fseek functionality which RandomAccessFile provides in java
543974774 is, in hex, 206C6576.
There is no endianness on the planet that turns 206C6576 into '1'. The problem is therefore that you aren't reading what you think you're reading: If the C code is reading 4 bytes (or even a variable, unknown number of bytes) and turns that into '1', then your java code wasn't reading the same bytes - your C code and java code is out of sync: At some point, your C code read, for example, 2 bytes, and your java code read 4 bytes, or vice versa.
The problem isn't in your readInt method - that does the job properly every time.
I am trying to convert comp3 and EBCIDIC characters in my java code but im running into out of memory exception as the amount of data handled is huge about 5 gb. my code is currently as follows:
byte[] data = Files.readAllBytes(path);
this is resulting in an out of memory exception which i can understand, but i cant use a file scanner as well since the data in the file wont be split into lines.
Can anyone point me in the correct direction on how to handle this
Note: the file may contain records of different length hence splitting it based on record length seams not possible.
As Bill said you could (should) ask for the data to be converted to display characters on the mainframe and if English speaking you can do a ascii transfer.
Also how are you deciding where comp-3 fields start ???
You do not have to read the whole file into memory, you can still read the file in blocks, This method will fill an array of bytes:
protected final int readBuffer(InputStream in, final byte[] buf)
throws IOException {
int total = 0;
int num = in.read(buf, total, buf.length);
while (num >= 0 && total + num < buf.length) {
total += num;
num = in.read(buf, total, buf.length - total);
}
return num;
}
if all the records are the same length, create an array of the record length and the above method will read one record at a time.
Finally the JRecord project has classes to read fixed length files etc. It can do comp-3 conversion. Note: I am the author of JRecord.
I'm running into out of memory exception as the amount of data handled is huge about 5 gb.
You only need to read one record at a time.
My code is currently as follows:
byte[] data = Files.readAllBytes(path);
This is resulting in an out of memory exception which i can understand
Me too.
but i cant use a file scanner as well since the data in the file wont be split into lines.
You mean you can't use the Scanner class? That's not the only way to read a record at a time.
In any case not all files have record delimiters. Some have fixed-length records, some have length words at the start of each record, and some have record type attributes at the start of each record, or in both cases at least in the fixed part of the record.
I'll have to split it based on an attribute record_id at a particular position(say at the begining of each record) that will tell me the record length
So read that attribute, decode it if necessary, and read the rest of the record according to the record length you derive from the attribute. One at a time.
I direct your attention to the methods of DataInputStream, especially readFully(). You will also need a Java COMP-3 library. There are several available. Most of the rest can be done by built-in EBCDIC character set decoders.
First of all I would try to explain what I need to do.
I need to read a file (whose size could be from 1 byte to 2 GB), 2 GB maximum because I try to use MappedByteBuffer for fast reading. Maybe later I will try to read file in chunks in order to read files of arbitrary size.
When i read file I convert its bytes and convert them (using ASCII encoding) to chars which later I put into a StringBuilder and then I put this String Builder into an ArrayList
However I also need to do the following:
User could type blockSize which is the number of chars I have to read into the StringBuilder (which is basically number of file bytes converted to chars)
Once I have collected the user defined char count, I create a copy of the String Builder and put it into an Array List
All steps are performed for every char read. The problem is with String Builder since if the file is big (<500 MB), I get the exception OutOfMemoryError.
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.lang.AbstractStringBuilder.<init>(AbstractStringBuilder.java:45)
at java.lang.StringBuilder.<init>(StringBuilder.java:80)
at java.lang.StringBuilder.<init>(StringBuilder.java:106)
at borrows.wheeler.ReadFile.readFile(ReadFile.java:43)
Java Result: 1
I post my code, maybe someone could suggest improvements to this code or suggest some alternatives.
public class ReadFile {
//matrix block size
public int blockSize = 100;
public int charCounter = 0;
public ArrayList readFile(File file) throws FileNotFoundException, IOException {
FileChannel fc = new FileInputStream(file).getChannel();
MappedByteBuffer mbb = fc.map(FileChannel.MapMode.READ_ONLY, 0, (int) fc.size());
ArrayList characters = new ArrayList();
int counter = 0;
StringBuilder sb = new StringBuilder();//blockSize-1
while (mbb.hasRemaining()) {
char charAscii = (char)mbb.get();
counter++;
charCounter++;
if (counter == blockSize){
sb.append(charAscii);
characters.add(new StringBuilder(sb));//new StringBuilder(sb)
sb.delete(0, sb.length());
counter = 0;
}else{
sb.append(charAscii);
}
if(!mbb.hasRemaining()){
characters.add(sb);
}
}
fc.close();
return characters;
}
}
EDIT:
I am doing Burrows-Wheeler transformation. There i should read every file then by Block Size create as many as needed matrixes. well i believe that wiki will explain better than me:
http://en.wikipedia.org/wiki/Burrows%E2%80%93Wheeler_transform
If you load large files, it's not entirely surprising that you run out of memory.
How much memory do you have? Are you on a 64-bit system with 64-bit Java? How much heap memory have you allocated (e.g using -Xmx setting)?
Bear in mind that you will need at least twice as much memory as the filesize, because Java uses Unicode UTF-16, which uses at least 2 bytes for each character, but your input is one byte per character. So to load a 2GB file you will need at least 4GB allocated to the heap just for storing this text data.
Also, you need to sort out the logic in your code - you do the same sb.append(charAscii) in the if and the else, and you test !mbb.hasRemaining() in every iteration of a while((mbb.hasRemaining()) loop.
As I asked in your previous question, do you need to store StringBuilders, or would the resulting Strings be OK? Storing strings would save space because StringBuilder allocates memory in big chunks (I think it doubles in size every time it runs out of space!) so may waste a lot.
If you do have to use StringBuilders then pre-sizing them to the value of blockSize would make the code more memory-efficient (and faster).
I try to use MappedByteBuffer for fast reading. Maybe later I will try
to read file in chunks in order to read files of arbitrary size.
When i read file I convert its bytes and convert them (using ASCII
encoding) to chars which later I put into a StringBuilder and then I
put this String Builder into an ArrayList
This sounds more like a problem than a solution. I suggest to you that the file already is ASCII, or character data; that it could be read pretty efficiently using a BufferedReader; and that it can be processed one line at a time.
So do that. You won't get even double the speed by using a MappedByteBuffer, and everything you're doing including the MappedByteBuffer is consuming memory on a truly heroic scale.
If the file isn't such that it can be processed line by line, or record by record, there is something badly wrong upstream.
I am a beginner and I have a file having variable sized records; there are two fields per row
i.e. one is 7-15 digits key and then followed by space there is a string which is also variable sized for each record.
I am trying to read bytes only of page size into my buffer and then process them.
The problem is that if i use Java.RanomAccessFile and use seek method to reach a particular line , then i use ReadFully method to read those 1024 bytes into my buffer. I have written the functions to convert byte into int and byte into string -but the problem is that I dont know how many bytes form that 7-15 digit and how many bytes form my string.
When you say a row, do you mean each row has a line separator in between? If that is the case, you can use something like BufferedReader's readline() method. That gives you a string which is 1 line without the line separator.
I use BufferedReader's readLine() method to read lines of text from a socket.
There is no obvious way to limit the length of the line read.
I am worried that the source of the data can (maliciously or by mistake) write a lot of data without any line feed character, and this will cause BufferedReader to allocate an unbounded amount of memory.
Is there a way to avoid that? Or do I have to implement a bounded version of readLine() myself?
The simplest way to do this will be to implement your own bounded line reader.
Or even simpler, reuse the code from this BoundedBufferedReader class.
Actually, coding a readLine() that works the same as the standard method is not trivial. Dealing with the 3 kinds of line terminator CORRECTLY requires some pretty careful coding. It is interesting to compare the different approaches of the above link with the Sun version and Apache Harmony version of BufferedReader.
Note: I'm not entirely convinced that either the bounded version or the Apache version is 100% correct. The bounded version assumes that the underlying stream supports mark and reset, which is certainly not always true. The Apache version appears to read-ahead one character if it sees a CR as the last character in the buffer. This would break on MacOS when reading input typed by the user. The Sun version handles this by setting a flag to cause the possible LF after the CR to be skipped on the next read... operation; i.e. no spurious read-ahead.
Another option is Apache Commons' BoundedInputStream:
InputStream bounded = new BoundedInputStream(is, MAX_BYTE_COUNT);
BufferedReader reader = new BufferedReader(new InputStreamReader(bounded));
String line = reader.readLine();
The limit for a String is 2 billion chars. If you want the limit to be smaller, you need to read the data yourself. You can read one char at a time from the buffered stream until the limit or a new line char is reached.
Perhaps the easiest solution is to take a slightly different approach. Instead of attempting to prevent a DoS by limiting one particular read, limit the entire amount of raw data read. In this way you don't need to worry about using special code for every single read and loop, so long as the memory allocated is proportionate to incoming data.
You can either meter the Reader, or probably more appropriately, the undecoded Stream or equivalent.
There are a few ways round this:
if the amount of data overall is very small, load data in from the socket into a buffer (byte array, bytebuffer, depending on what you prefer), then wrap the BufferedReader around the data in memory (via a ByteArrayInputStream etc);
just catch the OutOfMemoryError, if it occurs; catching this error is generally not reliable, but in the specific case of catching array allocation failures, it is basically safe (but does not solve the issue of any knock-on effect that one thread allocating large amounts from the heap could have on other threads running in your application, for example);
implement a wrapper InputStream that will only read so many bytes, then insert this between the socket and BufferedReader;
ditch BufferedReader and split your lines via the regular expressions framework (implement a CharSequence whose chars are pulled from the stream, and then define a regular expression that limits the length of lines); in principle, a CharSequence is supposed to be random access, but for a simple "line splitting" regex, in practice you will probably find that successive chars are always requested, so that you can "cheat" in your implementation.
In BufferedReader, instead of using String readLine(), use int read(char[] cbuf, int off, int len); you can then use boolean ready() to see if you got it all and convert in into a string using the constructor String(byte[] bytes, int offset, int length).
If you don't care about the whitespace and you just want to have a maximum number of characters per line, then the proposal Stephen suggested is really simple,
import java.io.BufferedReader;
import java.io.IOException;
public class BoundedReader extends BufferedReader {
private final int bufferSize;
private char buffer[];
BoundedReader(final BufferedReader in, final int bufferSize) {
super(in);
this.bufferSize = bufferSize;
this.buffer = new char[bufferSize];
}
#Override
public String readLine() throws IOException {
int no;
/* read up to bufferSize */
if((no = this.read(buffer, 0, bufferSize)) == -1) return null;
String input = new String(buffer, 0, no).trim();
/* skip the rest */
while(no >= bufferSize && ready()) {
if((no = read(buffer, 0, bufferSize)) == -1) break;
}
return input;
}
}
Edit: this is intended to read lines from a user terminal. It blocks until the next line, and returns a bufferSize-bounded String; any further input on the line is discarded.