Taking input in Java - java

I have some confusion if someone can help. Tried searching the web for it but didn't get any satisfying answer.
Why don't we simply use System.in.somemethod() to take input in Java just like we do for output? Like System.out.println() is used so why not System.in as it is? Why is there a long process for Input?

The only methods that System.in, an InputStream, provides are the overloads of read. Sure, you could do something like:
byte[] bytes = new byte[5];
System.in.read(bytes);
System.out.println(Arrays.toString(bytes));
to read five bytes from the console. But this has the following disadvantages:
You need to handle the checked IOException. (not shown in the code snippet above)
Hard to work with bytes. (unless you want them specifically)
You usually just want to read the input until the end of a line. With this it's hard to know where the end of a line is.
So that's why people use Scanners to wrap the System.in stream into something more user-friendly.

Taking input from the command line will always be trickier than just outputting data. This is because there is no way to know that the input is semantically correct, structured correctly or even syntactically correct.
If you just want to read bytes from System.in then a lot of the uncertainty of the input disappears. In that case there is only two things to take into account: I/O errors and end-of-input - both of which are also present for System.out. The only other thing that may be tricky is that InputStream may not return all the bytes that are requested in a single call to read.
So reading data from System.in isn't hard; interpreting the data - which often comes down to parsing the data or validating the data - is the hard part. And that's why often the Scanner class is used to make sense of the input.

Just as you cannot use System.out.somemethod() instead of System.out.println() in the same way you cannot use System.in.somemethod() instead of System.in.read().

Related

Reading ahead with BufferedReader (Java)

I'm writing a parser for files that look like this:
LOCUS SCU49845 5028 bp DNA PLN 21-JUN-1999
DEFINITION Saccharomyces cerevisiae TCP1-beta gene, partial cds, and Axl2p
(AXL2) and Rev7p (REV7) genes, complete cds.
ACCESSION U49845
VERSION U49845.1 GI:1293613
I want to get information preceded by certain tags (DEFINITION, VERSION etc.) but some descriptions cover multiple lines and I do need all of it. This is a problem when using BufferdReader to read my file.
I almost figured it out by using mark() and reset() but when executing my program I noticed that it only works for one tag and other tags are somehow skipped. This is the code I have so far:
Pattern pTag = Pattern.compile("^[A-Z]{2,}");//regex: 2 or more uppercase letters is a tag
Matcher mTagCurr = pTag.matcher(line);
if (mTagCurr.find()) {
reader.mark(1000);
String nextLine = reader.readLine();
Matcher mTagNext = pTag.matcher(nextLine);
if (mTagNext.find()){
reader.reset();
continue;
}
Pattern pWhite = Pattern.compile("^\\s{6,}");
Matcher mWhite = pWhite.matcher(nextLine);
while (mWhite.find()) {
line = line.concat(nextLine);
}
System.out.println(line);
}
This piece of code is supposed to find tags and concatenate descriptions that cover more than one line. Some answers I found here advised using Scanner. This is not an option for me. The files I work with can be very large (largest I encountered was >50GB) and by using BufferedReader I wish to put less of a strain on my system.
I suggest accumulating the information you get as your read it in a single pass parser. This will be simpler and faster in this case I suspect.
BTW, you want to cache your Patterns as creating them is quite expensive. You may find that you want ovoid using them entirely in some cases.
The code starts by finding a continuation line and calling reset() if it does not find it, but the code that reads additional lines does not seem to do that. Could it be reading the start of another section in the Genbank file and not putting it back? I don't see all the loop control code here, but what I do see appears to be correct.
If all else fails and you need something easy, there's always BioJava (see How to Read a Genbank File with Biojava3 and see if it helps). I have tried to use BioJava for my own projects, but it always falls a little short.
When I have written FASTA and FASTQ parsers, I read into a byte or char buffer and process it that way, but there is more buffer management code to write. That way, I don't have to worry about putting bytes back in a buffer. This can also avoid regex, which can be expensive in a time-critical application. Of course, this take more time to implement.
Tip: For fastest implementation if you are managing the buffer yourself, check out NIO (Java NIO Tutorial). I have seen give up a 10x speedup in some cases (writing data). The only drawback is that I have not found an easy way to read gzipped sequence data with NIO yet.

Reading different encoding from the same InputStream [duplicate]

I'm working through the problems in Programming Pearls, 2nd edition, Column 1. One of the problems involves writing a program that uses only around 1 megabyte of memory to store the contents of a file as a bit array with each bit representing whether or not a 7 digit number is present in the file. Since Java is the language I'm the most familiar with, I've decided to use it even though the author seems to have had C and C++ in mind.
Since I'm pretending memory is limited for the purpose of the problem I'm working on, I'd like to make sure the process of reading the file has no buffering at all.
I thought InputStreamReader would be a good solution, until I read this in the Java documentation:
To enable the efficient conversion of bytes to characters, more bytes may be read ahead from the underlying stream than are necessary to satisfy the current read operation.
Ideally, only the bytes that are necessary would be read from the stream -- in other words, I don't want any buffering.
One of the problems involves writing a program that uses only around 1 megabyte of memory to store the contents of a file as a bit array with each bit representing whether or not a 7 digit number is present in the file.
This implies that you need to read the file as bytes (not characters).
Assuming that you do have a genuine requirement to read from a file without buffering, then you should use the FileInputStream class. It does no buffering. It reads (or attempts to read) precisely the number of bytes that you asked for.
If you then need to convert those bytes to characters, you could do this by applying the appropriate String constructor to a byte or byte[]. Note that for multibyte character encodings such as UTF-8, you would need to read sufficient bytes to complete each character. Doing that without the possibility of read-ahead is a bit tricky ... and entails "knowledge* of the character encoding you are reading.
(You could avoid that knowledge by using a CharsetDecoder directly. But then you'd need to use the decode method that operates on Buffer objects, and that is a bit complicated too.)
For what it is worth, Java makes a clear distinction between stream-of-byte and stream-of-character I/O. The former is supported by InputStream and OutputStream, and the latter by Reader and Write. The InputStreamReader class is a Reader, that adapts an InputStream. You should not be considering using it for an application that wants to read stuff byte-wise.

Buffered Input Stream mark read limit

I am learning how to use an InputStream. I was trying to use mark for BufferedInputStream, but when I try to reset I have these exceptions:
java.io.IOException: Resetting to invalid mark
I think this means that my mark read limit is set wrong. I actually don't know how to set the read limit in mark(). I tried like this:
is = new BufferedInputStream(is);
is.mark(is.available());
This is also wrong.
is.mark(16);
This also throws the same exception.
How do I know what read limit I am supposed to set? Since I will be reading different file sizes from the input stream.
mark is sometimes useful if you need to inspect a few bytes beyond what you've read to decide what to do next, then you reset back to the mark and call the routine that expects the file pointer to be at the beginning of that logical part of the input. I don't think it is really intended for much else.
If you look at the javadoc for BufferedInputStream it says
The mark operation remembers a point in the input stream and the reset operation causes all the bytes read since the most recent mark operation to be reread before new bytes are taken from the contained input stream.
The key thing to remember here is once you mark a spot in the stream, if you keep reading beyond the marked length, the mark will no longer be valid, and the call to reset will fail. So mark is good for specific situations and not much use in other cases.
This will read 5 times from the same BufferedInputStream.
for (int i=0; i<5; i++) {
inputStream.mark(inputStream.available()+1);
// Read from input stream
Thumbnails.of(inputStream).forceSize(160, 160).toOutputStream(out);
inputStream.reset();
}
The value you pass to mark() is the amount backwards that you will need to reset. if you need to reset to the beginning of the stream, you will need a buffer as big as the entire stream. this is probably not a great design as it will not scale well to large streams. if you need to read the stream twice and you don't know the source of the data (e.g. if it's a file, you could just re-open it), then you should probably copy it to a temp file so you can re-read it at will.

equivalent of ungetc In java

I'm writing a program in java (already wrote a version in C too). I really need to put a character back to the input-stream after I read it. This can be accomplished by ungetc() in C/C++, and I was wonder how can I do the same thing in Java?
For those of you don't know C/C++:
char myc = (char)System.in.read();
and I check the value of myc and now I want to put back myc in to the System.in! so again when I call System.in, I get that value. (But How can I do this?)
NOTE: I'm looking for exact technique. Please do not advise me to catch it or log it somewhere else and read off from there, because I know how to those kinda stuff. What I'm interested in is equivalent of ungetc() in Java if there's any.
Cheers.
You are looking for PushbackInputStream
Java's IO library is designed so the primitives are really basic and additional functionality is added through composition. For example, if you wanted to buffer input from a file, you would call new BufferedInputStream(new FileInputStream("myfile.txt")); or if you wanted to read from the stream as text using UTF-8 encoding you'd call new InputStreamReader(new FileInputStream("myfile.txt"), "UTF-8");
So what you want to do is create a PushbackInputStream with new PushbackInputStream(System.in);
The caveat here is you're not actually pushing the bytes back onto standard input. Once you've read from System.in it's gone, and no other code accessing System.in will be able to get at that byte. Anything you push back will only ever be available from particular PushbackInputStream you created to handle the data.
If the input stream supports marking you can use the mark() and reset() methods to achieve what you intend.
You can wrap System.in into a BufferedInputStream or a BufferedReader to gain marking support.
A RandomAccessFile can .seek( .getFilePointer() - 1 ).

How do I identify that I am at the last byte of a serialized Java object?

Question
What is (if there is any) terminating characters/byte sequences in serialized java objects?
Background
I'm working on a small self-education project where I would like to serialize java objects and write them to a stream where there are read and then unserialized. Since, I will need to identify the borders between serialized objects and I can't be sure that the current object is not the last one, is there a terminating character that is always there that I can use as my identifier?
I noticed that there is a magic number ACED that allows me to identify the start of the object, so how do I identify the end?
EDIT:
If there is no terminating character, is there any safe terminating characters/sequences that I can use (insert) to identify the end of the object?
In theory you should always be able to find the end of an object, in practice you cannot. I understand the problem is customised writeObject implementations that don't call either defaultReadObject or readFields have a non-standard representation.
I've played about with serialisation in the past. Including creating streams for use when I've been doing unusual things to the ObjectInputStream. It's not pleasant(!).
You can read the details in the spec, and the source is worth a read.
there are none. AFAIK the only requirement is that the deserialiser know when to stop reading, when given a corresponding serialisation. subject to that, the serialiser can write whatever it wants -- in any position not just the last.
if you're old skool dump a 32-bit length field at the beginning a refuse to handle objects bigger than 4 gig.
nu scool, you just make sure your read and your write logic are consistent and don't care about the length.
You can add a terminating object to your object stream. e.g. null or a special String.
However, I suggest that you instead convert the ObjectsStream to a byte[] and write the byte length of the byte[] followed by its data. This way each ObjectStream is independent and you always know where it finishes.
Have you considered applying a record-marking layer similar to HTTP Chunked encoding?
The Chunked encoding is intended to solve a generalization of this scenario: identifying the end of a message of indeterminate length that both itself contains no identifiable end, and is embedded in a longer stream without ending it.

Categories