My understanding is that this is a common scenario, but Java doesn't have a baked in solution and I've been searching on and off for more than a day now. I have tried the CircularCharBuffer from the Ostermiller library, but that uses some sort of reader that constantly waits for new input, so I couldn't get readline() to detect the end of the content (it would just hang).
So could someone tell me how I could do a conversion? For what it's worth, I'm converting multiple (potentially many) PDF files to raw text using the PDFBox lib. The PDFBox API puts the content onto a Writer, after which I need to get at the content for further processing (so BufferedReader/Writer is not actually essential, but some kind of Reader/Writer). I know that this is possible using StringReader/Writer, but I'm not sure that this is efficient plus I loose the readline() method.
This is a bit like asking how to convert a pig into an elephant ... :-)
OK, there are two ways to address this problem (using the Java libraries):
You can capture the data written to a buffered writer so that it can then be read using a buffered reader. Basically, you do this by:
using your BufferedWriter to write to a StringWriter or CharArrayWriter,
closing it,
extracting the resulting stuff from the SW / CAW as a String, and
wrapping the String in a StringReader,
wrapping the StringReader in a BufferedReader.
You can create a PipedReader / PipedWriter pair and wrap them with BufferedReader and BufferedWriter respectively.
The two approaches both have disadvantages:
The first one requires you to complete the writing before constructing the read side. That means you need space to hold the entire stream content in memory, and you can't do producer-side and consumer-side processing in parallel.
The second one requires you to produce and consume in separate threads ... or risk having the pipeline block permanently.
Conceptually speaking, the Ostermiller library is really an reimplementation of PipeReader / PipeWriter. (And some of the advantages of his reimplementation were mooted in Java 1.6 ... which allows you to specify the pipeline's buffer size. Mark support is interesting, but I can imagine some problems, depending on how you used it.)
You might also be able to find a PipedReader / PipedWriter replacement that uses a flexible buffer that grows and contracts as required. (At least ... this is conceptually possible.)
The CircularCharBuffer from the Ostermiller lib has two methods getWriter() and getReader() to get a reader on the content of a writer, and vice versa. The reason the Reader was hanging at the final readLine() was because I wasn't calling close() on the writer after I had finished writing to it. So the final readLine() was waiting for new content on the writer that was never going to arrive.
The Ostermiller library can be found here.
Related
I'm trying to learn how to use DeflaterOutputStream as something to kill time during my winter break. I'm confused because when I look at the documentation https://docs.oracle.com/javase/7/docs/api/java/util/zip/DeflaterOutputStream.html, it says that deflate() is used to write a compressed data to OutputStream, while write() is to write data to the DeflaterOutputStream (compressed OutputStream) to be compressed.
However, I'm looking at sample codes on the internet, but none uses deflate() at all. All the code I've seen so far just write() to the DeflaterOutputStream without calling deflate().
https://stackoverflow.com/a/13060441/12181863
https://www.programcreek.com/java-api-examples/?api=java.util.zip.DeflaterOutputStream
I noticed that the code puts a FileOutputStream inside the DeflaterOutputStream, but how does it interact? Does it automatically call deflate() to send compressed data to FileOutputStream when data is written to DeflaterOutputStream?
It's protected: It is intended for anything subclassing that stream, and you're not subclassing it, so as far as you are concerned, it is an implementation detail you cannot include in your reasoning and which isn't meant for you to invoke.
Unless, of course, you subclass it.
Which you could - it's sort of a toolkit for building LZ-based compression streams on top of. That's why both GZipOutputStream and ZipOutputStream extend it: Those are different containers that more or less use the same compression technology. And they do invoke that deflate. Unless you're developing your own LZ-based compression system or implementing a reader for an existing, non-zip, non-gz, non-deflater based compression format, this is not meant for you.
These kinds of outputstreams are called 'filterstreams': They do not themselves represent any resource, they wrap around one. They can wrap around any OutputStream (or any InputStream, the concept works on 'both sides' so to speak), and modify bytes in transit.
var out = new DeflaterOutputStream(whatever) creates a new deflater stream that will compress any data you send to it (via out.write(stuff)), and it will in turn take the compressed data and send it on to whatever. It does the job of:
take bytes (as per out.write), buffer as much as is needed to do the job:
... of compressing this data.
Then process the compressed data, as it becomes compressed, by sending it to the wrapped outputstream (whatever, in this example), by calling its write method.
The basic usage is:
Create a resource, such as Files.newOutputStream or someSocket.getOutputStream or httpServletResponse.getOutputStream() or System.out or anything else that produces a stream - it's a abstract concept for a reason: To make things flexible.
Wrap that resource into a DeflaterOutputStream
Write all your data to the deflateroutputstream. Forget about the original - you made it so you can pass it to DeflaterOutputStream, and that's where your interaction with the underlying stream ends.
Close the deflaterstream (which will end up closing the underlying stream as well).
Is it possible to create a "TransformerOutputStream", which extends the standard java.io.OutputStream, wraps a provided output stream and applies an XSL transformation? I can't find any combination of APIs which allows me to do this.
The key point is that, once created, the TransformerOutputStream may be passed to other APIs which accept a standard java.io.OutputStream.
Minimal usage would be something like:
java.io.InputStream in = getXmlInput();
java.io.OutputStream out = getTargetOutput();
javax.xml.transform.Templates templates = createReusableTemplates(); // could also use S9API
TransformerOutputStream tos = new TransformerOutputStream(out, templates); // extends OutputStream
com.google.common.io.ByteStreams.copy(in, tos);
// possibly flush/close tos if required by implementation
That's a JAXP example, but as I'm currently using Saxon an S9API solution would be fine too.
The main avenue I've persued is along the lines of:
a class which extends java.io.OutputStream and implements org.xml.sax.ContentHandler
an XSL transformer based on an org.xml.sax.ContentHandler
But I can't find implementations of either of these, which seems to suggest that either no one else has ever tried to do this, there is some problem which makes it impractical, or my search skills just are not that good.
I can understand that with some templates an XML transformer may require access to the entire document and so a SAX content handler may provide no advantage, but there must also be simple transformations which could be applied to the stream as it passed through? This kind of interface would leave that decision up to the transformer implementation.
I have a written and am currently using a class which provides this interface, but it just collects the output data in an internal buffer then uses a standard JAXP StreamSource to read that on flush or close, so ends up buffering the entire document.
You could make your TransformerOutputStream extend ByteArrayOutputStream, and its close() method could take the underlying byte[] array, wrap it in a ByteArrayInputStream, and invoke a transformation with the input taken from this InputStream.
But it seems you also want to avoid putting the entire contents of the stream in memory. So let's assume that the transformation you want to apply is an XSLT 3.0 streamable transformation. Unfortunately, although Saxon as a streaming XSLT transformer operates largely in push mode (by "push" I mean that the data supplier invokes the data consumer, whereas "pull" means that the data consumer invokes the data supplier), the first stage, of reading and parsing the input, is always in pull mode -- I don't know of an XML parser to which you can push lexical XML input.
This means there's a push-pull conflict here. There are two solutions to a push-pull conflict. One is to buffer the data in memory (which is the ByteArrayOutputStream approach mentioned earlier). The other is to use two threads, with one writing to a shared buffer and the other reading from it. This can be achieved using a PipedOutputStream in the writing thread (https://docs.oracle.com/javase/8/docs/api/index.html?java/io/PipedOutputStream.html) and a PipedInputStream in the reading thread.
Caveat: I haven't actually tried this, but I see no reason why it shouldn't work.
Note that the topic of streaming in XSLT 3.0 is fairly complex; you will need to learn about it before you can make much progress here. I would start with Abel Braaksma's talk from XML London 2014: https://xmllondon.com/2014/presentations/braaksma
I'm working with a library that I have to provide an InputStream and a PrintStream. It uses the InputStream to gather data for processing and the PrintStream to provide results. I'm stuck using this library and its API cannot be altered.
There are two issues with this that I think have related solutions.
First, the data that needs to be read via the InputStream is not available upfront. Instead, the data is dynamically created by a different part of the application and given to my code as a String via method call. My code's job is to somehow allow the library to read this data through the InputStream provided as I get it.
Second, I need to somehow get the result that is written to the PrintStream and send it to another part of the application as a String. This needs to happen as immediately after the data is put in to the PrintStream as possible.
What it looks like I need are two stream objects that behave more or less like buffers. I need an InputStream that I can shove data in to whenever I have it and a PrintStream that I can grab it's contents whenever it has some. This seems a little awkward to me, but I'm not sure how else to do it.
I'm wondering if anything already exists that allows this kind of behavior or if there is a different (better) solution that will work in the situation I've described. The only thing I can come up with is to try to implement streams with this behavior, but that can become complicated fast (especially since the InputStream needs to block until data is available).
Any ideas?
Edit: To be clear, I'm not writing the library. I'm writing code that is supposed to provide the library with an InputStream to read data from and a PrintStream to write data to.
Looks like both streams need to be constantly reading/writing so you'll need two threads independent of each other. The pattern resembles JMS a little bit, in which case you're feeding information to a "queue" or "topic", and wait for it to be processed then put on a "output" queue/topic. This may introduce additional moving parts, but you could write a simple client to place info onto a JMS queue, then have a listener to just grab messages, and feed it to the input stream constantly. Then another piece of code to read from output stream, and do what you need with it.
Hope this helps.
Whenever I have to work with topics like file-handling or socket-programming, I have to look for sample code on the internet to see how the xxStreamxx,xxReader,xxWriter classes are used.
I want to be able to use them on my own and know how they work.
How do I go about learning that?
The main idea is simple.
Streams are for binary read/write. Readers/Writers are for character read/write (in Java byte is not a char, since char is unicode). If it is possible to convert binary stream into character sequence, there is an appropriate reader for a stream.
For example, FileInputStream extends InputStream is ty read file binary. If this is textual file to read, you wrap this object into InputStreamReader extends Reader providing character set. Now you are able to read characters.
If you want to do readLine() you need to wrap this reader into BufferedReader.
Similarly with writers.
So, the idea is wrapping to get new abilities.
First of all, you have to learn and understand what streams are. If you don't understand the concepts behind them, you will be always copying code. So read the "Basic I/O lesson of the java tutorial": http://docs.oracle.com/javase/tutorial/essential/io/streams.html. A comprehensive presentation about this topic is this one from javapassion.com: http://www.javapassion.com/javase/javaiostream.pdf.
While reading, as I usually told my students: "write code and make mistakes" :-)
In this website you could find a variety of examples on how to write your own streams in Java: http://java.sun.com/developer/technicalArticles/Streams/WritingIOSC/
Just looking at the examples sometimes helps you much more than the explanations...
Please scroll to the middle and bottom of the page.
I want to write a program in Java with support for unix pipeline. The problem is that my input files are images and I need in some way to separate them from one another.
I thought that there is no problem because I can read InputStream using ImageIO.read() without reseting position. But it isn't that simple. ImageIO.read() closes the stream every time an image is read. So I can't read more than one file from stdin. Do you have some solution for this?
The API for read() mentions, "This method does not close the provided InputStream after the read operation has completed; it is the responsibility of the caller to close the stream, if desired." You might also check the result for null and verify that a suitable ImageReader is available.