Substitute chars on Java 1.4 InputStream - java

I have an InputStream that is returning, for example:
<?xml version='1.0' ?><env:Envelope xmlns:env="http://schemas.xmlsoap.org/soap/envelope/"><bbs:rule xmlns:bbs="http://com.foo/bbs">
I then pass the stream to a method that return a byte array.
I'd like to substitute "com.foo" with something else, like "org.bar" before I pass to the byte[] method.
What is a good way to do that?

If you have a bytearray you can transform it into a String. Pay attention to the encoding, in the example I use utf-8. I think this is a simple way to do that:
String newString = new String(byteArray, "utf-8");
newString = newString.replace("com.foo", "org.bar");
return newString.getBytes("utf-8");

One way is to wrap your InputStream in your own FilterInputStream subclass that does the transformation on the fly. It will have to be a look-ahead stream that checks every "c" character to see if it is followed by "om.foo" and if so make the substitution. You'll probably have to override just the read() method.

A stream reads/writes bytes. Trying to replace text in a binary representation is asking for trouble. So the first thing to do would be wrapping this stream into a Reader (like InputStreamReader) which will take care of translating the binary data into character information for you. You'll have to know the encoding of your streamed data, however, to make sure it is interpreted correctly. For example, UTF-8 or ISO-8859-1.
Once you have your textual data, you can think of how to replace parts of it. One way to do this is using regular expressions. However, this means you'll first have to read the entire stream into a string, do the substitution and then return the byte array. For large amounts of data, this might be inefficient.
Since you're dealing with XML data, you could make use of a higher-level approach and parse the XML in some way that allows you to process the contents without having to store them entirely in an intermediate format. A SAXParser with your own ContentHandler would do the trick. As events arrive, simply write them out again but with the proper alterations. Another approach would be an XSLT transformation with some extension function magic.
Wasn't there supposed to be some support for stream manipulations like this in java.nio? Or was this planned for an upcoming Java version?

This may not be the most efficient way to do it, but it certainly works.
InputStream is = // input;
ByteArrayOutputStream baos = new ByteArrayOutputStream();
BufferedReader reader = new BufferedReader(new InputStreamReader(is));
BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(baos));
String line = null;
while((line = reader.readLine()) != null)
{
if(line.contains("com.foo"))
{
line = line.replace("com.foo", "org.bar");
}
writer.write(line);
}
return baos.toByteArray();

Related

Java.io Two ways to obtain buffered character stream from unbuffered byte one

I am switching to Java from c++ and now going through some of the documentation on Java IO. So if I want to make buffered character stream from unbuffered byte stream, I can do this in two ways:
Reader input1 = new BufferedReader(new InputStreamReader(new FileInputStream("Xanadu.txt")));
and
Reader input2 = new InputStreamReader(new BufferedInputStream(new FileInputStream("Xanadu.txt")));
So I can make it character and after this buffered or vise versa.
What is the difference between them and which is better?
Functionally, there is no difference. The two versions will behave the same way.
There is a likely to be difference in performance, with the first version likely to be a bit faster than the second version when you read characters from the Reader one at a time.
In the first version, an entire buffer full of data will be converted from bytes to chars in a single operation. Then each read() call on the Reader will fetch a character directly from the character buffer.
In the second version, each read() call on the Reader performs one or more read() calls on the input stream and converts only those bytes read to a character.
If I was going to implement this (precise) functionality, I would do it like this:
Reader input = new BufferedReader(new FileReader("Xanadu.txt"));
and let FileReader deal with the bytes-to-characters decoding under the hood.
There is a case for using an InputStreamReader, but only if you need to specify the character set for the bytes-to-characters conversion explicitly.

It's the String conversion again: UNIX Windows-1252 to String

I'm downloading a website in Java, using all this:
myUrl = new URL("here is my URL");
in = new BufferedReader(new InputStreamReader(myUrl.openStream()));
In this file however there are some special characters like ä,ö and ü. I need to be able to print these out properly.
I try to encode the Strings using:
String encodedString = new String(toEncode.getBytes("Windows-1252"), "UTF-8");
But all it does is replace these special characters with a ?.
When I open what I am trying to print here using a downloaded .html file from Chrome with Notepad++, it says (in the bottom right corner) UNIX and Windows-1252. That's all I know about the encoded file.
What more steps can I take to figure out what is wrong?
--AND--
How can I convert this file so that I can properly read and print it in Java?
Sorry if this question is kind of stupid... I simply don't know any better and couldn't find anything on the internet.
OK, so you are mixing a lot of things here.
First of all, you do:
new InputStreamReader(myUrl.openStream())
this wil open a reader, yes; however, it will do so using your default JRE/OS Charset. Maybe not what you want.
Try and specify that you want UTF_8 (note, Java 7+ code):
try (
final InputStream in = myUrl.openStream();
final Reader reader = new InputStreamReader(in, StandardCharsets.UTF_8);
) {
// read from the reader here
}
Now, what you are mixing...
You read from an InputStream; an InputStream only knows how to read bytes.
But you want text; and in Java, text means a sequence of chars.
Let us forget for a moment that you want chars and focus on the fact that you want text; let us substitute a char for a carrier pigeon.
Now, what you need to do is to transform this stream of bytes into a stream of carrier pigeons. For this, you need a particular process. And in this case, the process is called decoding.
Back to Java, now. There also exists a process which does the reverse: encoding a stream of carrier pigeons (or chars) into a stream of bytes.
The trick... There exist several ways to do that; Unicode refers to them as character codings; and in Java, the base class which provides both encoders and decoders is a Charset.
Now, an InputStreamReader accepts a Charset as a second argument... Which you should ALWAYS specify. If you DO NOT, this:
new InputStreamReader(in);
will be equivalent to:
new InputStreamReader(in, Charset.defaultCharset());
and Charset.defaultCharset() is Not. Guaranteed. To. Be. The. Same. Amongst. Implementations. Of. JREs.

Read lines from Java FileInputStream without losing my place

I have a FileInputStream. I'd like to read character-oriented, linewise data from it, until I find a particular delimiter. Then I'd like to pass the FileInputStream, with the current position set immediately after the end of the delimiter line, to a library that needs an InputStream.
I can use a BufferedReader to walk through the file a line at a time, and everything works great. However, this leaves the underlying file stream in
BufferedReader br = new BufferedReader(new InputStreamReader(myFileStream))
at a non-deterministic position -- the BufferedReader had to look ahead, and I don't know how far, and AFAICT there's no way to tell the BufferedReader to rewind the underlying stream to just after the last-returned line.
Is this the best solution? It seems crazy to have a ReaderInputStream(BufferedReader(InputStreamReader(FileInputStream))) but it's the only way I've seen to avoid rolling my own. I'd really like to avoid writing my own entire stream-that-reads-lines implementation if at all possible.
You cannot unbuffer a buffered reader. You have to use the same wrapper for the life for the application. In your situation I would use
DataInputStream dis = new DataInputStream(new BufferedInputStream(new FileInputStream(file)));
String line = dis.readLine();
While DataInputStream.readLine() is deprecated, it could work for you if you are careful. Otherwise you only option is to read the bytes yourself and parse the text using the encoding required.

How to force UTF-16 while reading/writing in Java?

I see that you can specify UTF-16 as the charset via Charset.forName("UTF-16"), and that you can create a new UTF-16 decoder via Charset.forName("UTF-16").newDecoder(), but I only see the ability to specify a CharsetDecoder on InputStreamReader's constructor.
How so how do you specify to use UTF-16 while reading any stream in Java?
Input streams deal with raw bytes. When you read directly from an input stream, all you get is raw bytes where character sets are irrelevant.
The interpretation of raw bytes into characters, by definition, requires some sort of translation: how do I translate from raw bytes into a readable string? That "translation" comes in the form of a character set.
This "added" layer is implemented by Readers. Therefore, to read characters (rather than bytes) from a stream, you need to construct a Reader of some sort (depending on your needs) on top of the stream. For example:
InputStream is = ...;
Reader reader = new InputStreamReader(is, Charset.forName("UTF-16"));
This will cause reader.read() to read characters using the character set you specified. If you would like to read entire lines, use BufferedReader on top:
BufferedReader reader = new BufferedReader(new InputStreamReader(is, Charset.forName("UTF-16")));
String line = reader.readLine();

How to detect illegal UTF-8 byte sequences to replace them in java inputstream?

The file in question is not under my control. Most byte sequences are valid UTF-8, it is not ISO-8859-1 (or an other encoding).
I want to do my best do extract as much information as possible.
The file contains a few illegal byte sequences, those should be replaces with the replacement character.
It's not an easy task, it think it requires some knowledge about the UTF-8 state machine.
Oracle has a wrapper which does what I need:
UTF8ValidationFilter javadoc
Is there something like that available (commercially or as free software)?
Thanks
-stephan
Solution:
final BufferedInputStream in = new BufferedInputStream(istream);
final CharsetDecoder charsetDecoder = StandardCharsets.UTF_8.newDecoder();
charsetDecoder.onMalformedInput(CodingErrorAction.REPLACE);
charsetDecoder.onUnmappableCharacter(CodingErrorAction.REPLACE);
final Reader inputReader = new InputStreamReader(in, charsetDecoder);
java.nio.charset.CharsetDecoder does what you need. This class provides charset decoding with user-definable actions on different kinds of errors (see onMalformedInput() and onUnmappableCharacter()).
CharsetDecoder writes to an OutputStream, which you can pipe into an InputStream using java.io.PipedOutputStream, effectively creating a filtered InputStream.
One way would be to read the first few bytes to check Byte Order Mark (if exists). More information on BOM: http://en.wikipedia.org/wiki/Byte_order_mark In the given url, you will find a table of the BOM bytes. However, one problem is, UTF-8 does not require to use BOM in its' header. There is another way to solve the problem is by pattern recognition (read few bytes-8 bits each time). Anyway, this is the complicated solution..
The behaviour you want is already the default for InputStreamReader. So there is no need to specify it yourself. This suffices:
final BufferedInputStream in = new BufferedInputStream(istream);
final Reader inputReader = new InputStreamReader(in, StandardCharsets.UTF_8);

Categories