I'm wondering if there is any ideomatic way to chain multiple InputStreams into one continual InputStream in Java (or Scala).
What I need it for is to parse flat files that I load over the network from an FTP-Server. What I want to do is to take file[1..N], open up streams and then combine them into one stream. So when file1 comes to an end, I want to start reading from file2 and so on, until I reach the end of fileN.
I need to read these files in a specific order, data comes from a legacy system that produces files in barches so data in one depends on data in another file, but I would like to handle them as one continual stream to simplify my domain logic interface.
I searched around and found PipedInputStream, but I'm not positive that is what I need. An example would be helpful.
It's right there in JDK! Quoting JavaDoc of SequenceInputStream:
A SequenceInputStream represents the logical concatenation of other input streams. It starts out with an ordered collection of input streams and reads from the first one until end of file is reached, whereupon it reads from the second one, and so on, until end of file is reached on the last of the contained input streams.
You want to concatenate arbitrary number of InputStreams while SequenceInputStream accepts only two. But since SequenceInputStream is also an InputStream you can apply it recursively (nest them):
new SequenceInputStream(
new SequenceInputStream(
new SequenceInputStream(file1, file2),
file3
),
file4
);
...you get the idea.
See also
How do you merge two input streams in Java? (dup?)
This is done using SequencedInputStream, which is straightforward in Java, as Tomasz Nurkiewicz's answer shows. I had to do this repeatedly in a project recently, so I added some Scala-y goodness via the "pimp my library" pattern.
object StreamUtils {
implicit def toRichInputStream(str: InputStream) = new RichInputStream(str)
class RichInputStream(str: InputStream) {
// a bunch of other handy Stream functionality, deleted
def ++(str2: InputStream): InputStream = new SequenceInputStream(str, str2)
}
}
With that, I can do stream sequencing as follows
val mergedStream = stream1++stream2++stream3
or even
val streamList = //some arbitrary-length list of streams, non-empty
val mergedStream = streamList.reduceLeft(_++_)
Another solution: first create a list of input stream and then create the sequence of input streams:
List<InputStream> iss = Files.list(Paths.get("/your/path"))
.filter(Files::isRegularFile)
.map(f -> {
try {
return new FileInputStream(f.toString());
} catch (Exception e) {
throw new RuntimeException(e);
}
}).collect(Collectors.toList());
new SequenceInputStream(Collections.enumeration(iss)))
Here is a more elegant solution using Vector, this is for Android specifically but use vector for any Java
AssetManager am = getAssets();
Vector v = new Vector(Constant.PAGES);
for (int i = 0; i < Constant.PAGES; i++) {
String fileName = "file" + i + ".txt";
InputStream is = am.open(fileName);
v.add(is);
}
Enumeration e = v.elements();
SequenceInputStream sis = new SequenceInputStream(e);
InputStreamReader isr = new InputStreamReader(sis);
Scanner scanner = new Scanner(isr); // or use bufferedReader
Here's a simple Scala version that concatenates an Iterator[InputStream]:
import java.io.{InputStream, SequenceInputStream}
import scala.collection.JavaConverters._
def concatInputStreams(streams: Iterator[InputStream]): InputStream =
new SequenceInputStream(streams.asJavaEnumeration)
Related
I saw a few discussions on this but couldn't quite understand the right solution:
I want to load a couple hundred files from S3 into an RDD. Here is how I'm doing it now:
ObjectListing objectListing = s3.listObjects(new ListObjectsRequest().
withBucketName(...).
withPrefix(...));
List<String> keys = new LinkedList<>();
objectListing.getObjectSummaries().forEach(summery -> keys.add(summery.getKey())); // repeat while objectListing.isTruncated()
JavaRDD<String> events = sc.parallelize(keys).flatMap(new ReadFromS3Function(clusterProps));
The ReadFromS3Function does the actual reading using the AmazonS3 client:
public Iterator<String> call(String s) throws Exception {
AmazonS3 s3Client = getAmazonS3Client(properties);
S3Object object = s3Client.getObject(new GetObjectRequest(...));
InputStream is = object.getObjectContent();
List<String> lines = new LinkedList<>();
String str;
try {
BufferedReader reader = new BufferedReader(new InputStreamReader(is));
if (is != null) {
while ((str = reader.readLine()) != null) {
lines.add(str);
}
} else {
...
}
} finally {
...
}
return lines.iterator();
I kind of "translated" this from answers I saw for the same question in Scala. I think it's also possible to pass the entire list of paths to sc.textFile(...), but I'm not sure which is the best-practice way.
the underlying problem is that listing objects in s3 is really slow, and the way it is made to look like a directory tree kills performance whenever something does a treewalk (as wildcard pattern maching of paths does).
The code in the post is doing the all-children listing which delivers way better performance, it's essentially what ships with Hadoop 2.8 and s3a listFiles(path, recursive) see HADOOP-13208.
After getting that listing, you've got strings to objects paths which you can then map to s3a/s3n paths for spark to handle as text file inputs, and which you can then apply work to
val files = keys.map(key -> s"s3a://$bucket/$key").mkString(",")
sc.textFile(files).map(...)
And as requested, here's the java code used.
String prefix = "s3a://" + properties.get("s3.source.bucket") + "/";
objectListing.getObjectSummaries().forEach(summary -> keys.add(prefix+summary.getKey()));
// repeat while objectListing truncated
JavaRDD<String> events = sc.textFile(String.join(",", keys))
Note that I switched s3n to s3a, because, provided you have the hadoop-aws and amazon-sdk JARs on your CP, the s3a connector is the one you should be using. It's better, and its the one which gets maintained and tested against spark workloads by people (me). See The history of Hadoop's S3 connectors.
You may use sc.textFile to read multiple files.
You can pass multiple file url with as its argument.
You can specify whole directories, use wildcards and even CSV of directories and wildcards.
Ex:
sc.textFile("/my/dir1,/my/paths/part-00[0-5]*,/another/dir,/a/specific/file")
Reference from this ans
I guess if you try to parallelize while reading aws will be utilizing executor and definitely improve the performance
val bucketName=xxx
val keyname=xxx
val df=sc.parallelize(new AmazonS3Client(new BasicAWSCredentials("awsccessKeyId", "SecretKey")).listObjects(request).getObjectSummaries.map(_.getKey).toList)
.flatMap { key => Source.fromInputStream(s3.getObject(bucketName, keyname).getObjectContent: InputStream).getLines }
I need to write something into a text file's beginning. I have a text file with content and i want write something before this content. Say i have;
Good afternoon sir,how are you today?
I'm fine,how are you?
Thanks for asking,I'm great
After modifying,I want it to be like this:
Page 1-Scene 59
25.05.2011
Good afternoon sir,how are you today?
I'm fine,how are you?
Thanks for asking,I'm great
Just made up the content :) How can i modify a text file like this way?
You can't really modify it that way - file systems don't generally let you insert data in arbitrary locations - but you can:
Create a new file
Write the prefix to it
Copy the data from the old file to the new file
Move the old file to a backup location
Move the new file to the old file's location
Optionally delete the old backup file
Just in case it will be useful for someone here is full source code of method to prepend lines to a file using Apache Commons IO library. The code does not read whole file into memory, so will work on files of any size.
public static void prependPrefix(File input, String prefix) throws IOException {
LineIterator li = FileUtils.lineIterator(input);
File tempFile = File.createTempFile("prependPrefix", ".tmp");
BufferedWriter w = new BufferedWriter(new FileWriter(tempFile));
try {
w.write(prefix);
while (li.hasNext()) {
w.write(li.next());
w.write("\n");
}
} finally {
IOUtils.closeQuietly(w);
LineIterator.closeQuietly(li);
}
FileUtils.deleteQuietly(input);
FileUtils.moveFile(tempFile, input);
}
I think what you want is random access. Check out the related java tutorial. However, I don't believe you can just insert data at an arbitrary point in the file; If I recall correctly, you'd only overwrite the data. If you wanted to insert, you'd have to have your code
copy a block,
overwrite with your new stuff,
copy the next block,
overwrite with the previously copied block,
return to 3 until no more blocks
As #atk suggested, java.nio.channels.SeekableByteChannel is a good interface. But it is available from 1.7 only.
Update : If you have no issue using FileUtils then use
String fileString = FileUtils.readFileToString(file);
This isn't a direct answer to the question, but often files are accessed via InputStreams. If this is your use case, then you can chain input streams via SequenceInputStream to achieve the same result. E.g.
InputStream inputStream = new SequenceInputStream(new ByteArrayInputStream("my line\n".getBytes()), new FileInputStream(new File("myfile.txt")));
I will leave it here just in case anyone need
ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();
try (FileInputStream fileInputStream1 = new FileInputStream(fileName1);
FileInputStream fileInputStream2 = new FileInputStream(fileName2)) {
while (fileInputStream2.available() > 0) {
byteArrayOutputStream.write(fileInputStream2.read());
}
while (fileInputStream1.available() > 0) {
byteArrayOutputStream.write(fileInputStream1.read());
}
}
try (FileOutputStream fileOutputStream = new FileOutputStream(fileName1)) {
byteArrayOutputStream.writeTo(fileOutputStream);
}
I stole this snippet off the web. But it looks to be limited to 4096 bytes and is quite ugly IMO. Anyone know of a better approach? I'm actually using Groovy btw...
String streamToString(InputStream input) {
StringBuffer out = new StringBuffer();
byte[] b = new byte[4096];
for (int n; (n = input.read(b)) != -1;) {
out.append(new String(b, 0, n));
}
return out.toString();
}
EDIT:
I found a better solution in Groovy:
InputStream exportTemplateStream = getClass().getClassLoader().getResourceAsStream("export.template")
assert exportTemplateStream: "[export.template stream] resource not found"
String exportTemplate = exportTemplateStream.text
Some good and fast answers. However I think the best one is Groovy has added a "getText" method to InputStream. So all I had to do was stream.text. And good call on the 4096 comment.
For Groovy
filePath = ... //< a FilePath object
stream = filePath.read() //< InputStream object
// Specify the encoding, and get the String object
//content = stream.getText("UTF-16")
content = stream.getText("UTF-8")
The InputStream class reference
The getText() without encoding, it will use current system encoding, ex ("UTF-8").
Try IOUtils from Apache Commons:
String s = IOUtils.toString(inputStream, "UTF-8");
It's reading the input in chunks of 4096 bytes(4KB), but the size of the actual string is not limited as it keeps reading more and appending it to the SringBuffer.
You can do it fairly easily using the Scanner class:
String streamToSring(InputStream input) {
Scanner s = new Scanner(input);
StringBuilder builder = new StringBuilder();
while (s.hasNextLine()) {
builder.append(s.nextLine() +"\n");
}
return builder.toString();
}
That snippet has a bug: if the input uses a multi-byte character encoding, there's a good chance that a single character will span two reads (and not be convertable). And it also has the semi-bug that it relies on the platform's default encoding.
Instead, use Jakarta Commons IO. In particular, the version of IOUtils.toString() that takes an InputStream and applies an encoding to it.
For future reviewers who have similar problems, please note that both IOUtils from Apache, and Groovy's InputStream.getText() method require the stream to complete, or be closed before returning. If you are working with a persistent stream you will nead to deal with the "ugly" example that Phil originally posted, or work with non-blocking IO.
You can try something similar to this
new FileInputStream( new File("c:/tmp/file.txt") ).eachLine { println it }
I've got two text files that I want to grab as a stream and convert to a string. Ultimately, I want the two separate files to merge.
So far, I've got
//get the input stream of the files.
InputStream is =
cts.getClass().getResourceAsStream("/files/myfile.txt");
// convert the stream to string
System.out.println(cts.convertStreamToString(is));
getResourceAsStream doesn't take multiple strings as arguments. So what do I need to do? Separately convert them and merge together?
Can anyone show me a simple way to do that?
It sounds like you want to concatenate streams. You can use a SequenceInputStream to create a single stream from multiple streams. Then read the data from this single stream and use it as you need.
Here's an example:
String encoding = "UTF-8"; /* You need to know the right character encoding. */
InputStream s1 = ..., s2 = ..., s3 = ...;
Enumeration<InputStream> streams =
Collections.enumeration(Arrays.asList(s1, s2, s3));
Reader r = new InputStreamReader(new SequenceInputStream(streams), encoding);
char[] buf = new char[2048];
StringBuilder str = new StringBuilder();
while (true) {
int n = r.read(buf);
if (n < 0)
break;
str.append(buf, 0, n);
}
r.close();
String contents = str.toString();
You can utilize commons-io which has the ability to read a Stream into a String
http://commons.apache.org/io/api-release/org/apache/commons/io/IOUtils.html#toString%28java.io.InputStream%29
Off hand I can think of a couple ways
Create a StringBuilder, then convert each stream to a string and append to the stringbuilder.
Or, create a writable memorystream and stream each input stream into that memorystream, then convert that to a string.
Create a loop that for each file loads the text into a StringBuilder. Then once each file's data is appended, call toString() on the builder.
I would like to create a simple program (in Java) which edits text files - particularly one which performs inserting arbitrary pieces of text at random positions in a text file. This feature is part of a larger program I am currently writing.
Reading the description about java.util.RandomAccessFile, it appears that any write operations performed in the middle of a file would actually overwrite the exiting content. This is a side-effect which I would like to avoid (if possible).
Is there a simple way to achieve this?
Thanks in advance.
Okay, this question is pretty old, but FileChannels exist since Java 1.4 and I don't know why they aren't mentioned anywhere when dealing with the problem of replacing or inserting content in files. FileChannels are fast, use them.
Here's an example (ignoring exceptions and some other stuff):
public void insert(String filename, long offset, byte[] content) {
RandomAccessFile r = new RandomAccessFile(new File(filename), "rw");
RandomAccessFile rtemp = new RandomAccessFile(new File(filename + "~"), "rw");
long fileSize = r.length();
FileChannel sourceChannel = r.getChannel();
FileChannel targetChannel = rtemp.getChannel();
sourceChannel.transferTo(offset, (fileSize - offset), targetChannel);
sourceChannel.truncate(offset);
r.seek(offset);
r.write(content);
long newOffset = r.getFilePointer();
targetChannel.position(0L);
sourceChannel.transferFrom(targetChannel, newOffset, (fileSize - offset));
sourceChannel.close();
targetChannel.close();
}
Well, no, I don't believe there is a way to avoid overwriting existing content with a single, standard Java IO API call.
If the files are not too large, just read the entire file into an ArrayList (an entry per line) and either rewrite entries or insert new entries for new lines.
Then overwrite the existing file with new content, or move the existing file to a backup and write a new file.
Depending on how sophisticated the edits need to be, your data structure may need to change.
Another method would be to read characters from the existing file while writing to the edited file and edit the stream as it is read.
If Java has a way to memory map files, then what you can do is extend the file to its new length, map the file, memmove all the bytes down to the end to make a hole and write the new data into the hole.
This works in C. Never tried it in Java.
Another way I just thought of to do the same but with random file access.
Seek to the end - 1 MB
Read 1 MB
Write that to original position + gap size.
Repeat for each previous 1 MB working toward the beginning of the file.
Stop when you reach the desired gap position.
Use a larger buffer size for faster performance.
You can use following code:
BufferedReader reader = null;
BufferedWriter writer = null;
ArrayList list = new ArrayList();
try {
reader = new BufferedReader(new FileReader(fileName));
String tmp;
while ((tmp = reader.readLine()) != null)
list.add(tmp);
OUtil.closeReader(reader);
list.add(0, "Start Text");
list.add("End Text");
writer = new BufferedWriter(new FileWriter(fileName));
for (int i = 0; i < list.size(); i++)
writer.write(list.get(i) + "\r\n");
} catch (Exception e) {
e.printStackTrace();
} finally {
OUtil.closeReader(reader);
OUtil.closeWriter(writer);
}
I don't know if there's a handy way to do it straight otherwise than
read the beginning of the file and write it to target
write your new text to target
read the rest of the file and write it to target.
About the target : You can construct the new contents of the file in memory and then overwrite the old content of the file if the files handled aren't so big. Or you can write the result to a temporary file.
The thing would probably be easiest to do with streams, RandomAccessFile doesn't seem to be meant for inserting in the middle (afaik). Check the tutorial if you need.
I believe the only way to insert text into an existing text file is to read the original file and write the content in a temporary file with the new text inserted. Then erase the original file and rename the temporary file to the original name.
This example is focused on inserted a single line into an existing file, but still maybe of use to you.
If it is a text file,,,,Read the existing file in StringBuffer and append the new content in the same StringBuffer now u can write the SrtingBuffer on file. so now the file contains both the existing and new text.
As #xor_eq answer's edit queue is full, here in a new answer a more documented and slightly improved version of his:
public static void insert(String filename, long offset, byte[] content) throws IOException {
File temp = Files.createTempFile("insertTempFile", ".temp").toFile(); // Create a temporary file to save content to
try (RandomAccessFile r = new RandomAccessFile(new File(filename), "rw"); // Open file for read & write
RandomAccessFile rtemp = new RandomAccessFile(temp, "rw"); // Open temporary file for read & write
FileChannel sourceChannel = r.getChannel(); // Channel of file
FileChannel targetChannel = rtemp.getChannel()) { // Channel of temporary file
long fileSize = r.length();
sourceChannel.transferTo(offset, (fileSize - offset), targetChannel); // Copy content after insert index to
// temporary file
sourceChannel.truncate(offset); // Remove content past insert index from file
r.seek(offset); // Goto back of file (now insert index)
r.write(content); // Write new content
long newOffset = r.getFilePointer(); // The current offset
targetChannel.position(0L); // Goto start of temporary file
sourceChannel.transferFrom(targetChannel, newOffset, (fileSize - offset)); // Copy all content of temporary
// to end of file
}
Files.delete(temp.toPath()); // Delete the temporary file as not needed anymore
}