Cloud Dataflow: reading entire text files rather than lines by line

Cloud Dataflow: reading entire text files rather than lines by line - java

I'm looking for a way to read ENTIRE files so that every file will be read entirely to a single String.
I want to pass a pattern of JSON text files on gs://my_bucket/*/*.json, have a ParDo then process each and every file entirely.
What's the best approach to it?

I am going to give the most generally useful answer, even though there are special cases [1] where you might do something different.
I think what you want to do is to define a new subclass of FileBasedSource and use Read.from(<source>). Your source will also include a subclass of FileBasedReader; the source contains the configuration data and the reader actually does the reading.
I think a full description of the API is best left to the Javadoc, but I will highlight the key override points and how they relate to your needs:
FileBasedSource#isSplittable() you will want to override and return false. This will indicate that there is no intra-file splitting.
FileBasedSource#createForSubrangeOfFile(String, long, long) you will override to return a sub-source for just the file specified.
FileBasedSource#createSingleFileReader() you will override to produce a FileBasedReader for the current file (the method should assume it is already split to the level of a single file).
To implement the reader:
FileBasedReader#startReading(...) you will override to do nothing; the framework will already have opened the file for you, and it will close it.
FileBasedReader#readNextRecord() you will override to read the entire file as a single element.
[1] One example easy special case is when you actually have a small number of files, you can expand them prior to job submission, and they all take the same amount of time to process. Then you can just use Create.of(expand(<glob>)) followed by ParDo(<read a file>).

Was looking for similar solution myself. Following Kenn's recommendations and few other references such as XMLSource.java, created the following custom source which seems to be working fine.
I am not a developer so if anyone has suggestions on how to improve it, please feel free to contribute.
public class FileIO {
// Match TextIO.
public static Read.Bounded<KV<String,String>> readFilepattern(String filepattern) {
return Read.from(new FileSource(filepattern, 1));
}
public static class FileSource extends FileBasedSource<KV<String,String>> {
private String filename = null;
public FileSource(String fileOrPattern, long minBundleSize) {
super(fileOrPattern, minBundleSize);
}
public FileSource(String filename, long minBundleSize, long startOffset, long endOffset) {
super(filename, minBundleSize, startOffset, endOffset);
this.filename = filename;
}
// This will indicate that there is no intra-file splitting.
#Override
public boolean isSplittable(){
return false;
}
#Override
public boolean producesSortedKeys(PipelineOptions options) throws Exception {
return false;
}
#Override
public void validate() {}
#Override
public Coder<KV<String,String>> getDefaultOutputCoder() {
return KvCoder.of(StringUtf8Coder.of(),StringUtf8Coder.of());
}
#Override
public FileBasedSource<KV<String,String>> createForSubrangeOfFile(String fileName, long start, long end) {
return new FileSource(fileName, getMinBundleSize(), start, end);
}
#Override
public FileBasedReader<KV<String,String>> createSingleFileReader(PipelineOptions options) {
return new FileReader(this);
}
}
/**
* A reader that should read entire file of text from a {#link FileSource}.
*/
private static class FileReader extends FileBasedSource.FileBasedReader<KV<String,String>> {
private static final Logger LOG = LoggerFactory.getLogger(FileReader.class);
private ReadableByteChannel channel = null;
private long nextOffset = 0;
private long currentOffset = 0;
private boolean isAtSplitPoint = false;
private final ByteBuffer buf;
private static final int BUF_SIZE = 1024;
private KV<String,String> currentValue = null;
private String filename;
public FileReader(FileSource source) {
super(source);
buf = ByteBuffer.allocate(BUF_SIZE);
buf.flip();
this.filename = source.filename;
}
private int readFile(ByteArrayOutputStream out) throws IOException {
int byteCount = 0;
while (true) {
if (!buf.hasRemaining()) {
buf.clear();
int read = channel.read(buf);
if (read < 0) {
break;
}
buf.flip();
}
byte b = buf.get();
byteCount++;
out.write(b);
}
return byteCount;
}
#Override
protected void startReading(ReadableByteChannel channel) throws IOException {
this.channel = channel;
}
#Override
protected boolean readNextRecord() throws IOException {
currentOffset = nextOffset;
ByteArrayOutputStream buf = new ByteArrayOutputStream();
int offsetAdjustment = readFile(buf);
if (offsetAdjustment == 0) {
// EOF
return false;
}
nextOffset += offsetAdjustment;
isAtSplitPoint = true;
currentValue = KV.of(this.filename,CoderUtils.decodeFromByteArray(StringUtf8Coder.of(), buf.toByteArray()));
return true;
}
#Override
protected boolean isAtSplitPoint() {
return isAtSplitPoint;
}
#Override
protected long getCurrentOffset() {
return currentOffset;
}
#Override
public KV<String,String> getCurrent() throws NoSuchElementException {
return currentValue;
}
}
}

A much simpler method is to generate the list of filenames and write a function to process each file individually. I'm showing Python, but Java is similar:
def generate_filenames():
for shard in xrange(0, 300):
yield 'gs://bucket/some/dir/myfilname-%05d-of-00300' % shard
with beam.Pipeline(...) as p:
(p | generate_filenames()
| beam.FlatMap(lambda filename: readfile(filename))
| ...)

FileIO does that for you without the need to implement your own FileBasedSource.
Create matches for each of the files that you want to read:
mypipeline.apply("Read files from GCS", FileIO.match().filepattern("gs://mybucket/myfilles/*.txt"))
Also, you can read like this if do not want Dataflow to throw exceptions when no file is found for your filePattern:
mypipeline.apply("Read files from GCS", FileIO.match().filepattern("gs://mybucket/myfilles/*.txt").withEmptyMatchTreatment(EmptyMatchTreatment.ALLOW))
Read your matches using FileIO:
.apply("Read file matches", FileIO.readMatches())
The above code returns a PCollection of the type FileIO.ReadableFile (PCollection<FileIO.ReadableFile>). Then you create a DoFn that process these ReadableFiles to meet your use case.
.apply("Process my files", ParDo.of(MyCustomDoFnToProcessFiles.create()))
You can read the entire documentation for FileIO here.

Related

How to write data in file in specific format with FileOutputStream and PrintWriter

I have class Artical:
first variable is code of artical, second variable is name of article and third is price of article.
public class Artical {
private final String codeOfArtical;
private final String nameOfArtical;
private double priceOfArtical;
public Artical(String codeOfArtical, String nameOfArtical, double priceOfArtical) {
this.codeOfArtical= codeOfArtical;
this.nameOfArtical= nameOfArtical;
this.priceOfArtical= priceOfArtical;
}
public void setPriceOfArtical(double priceOfArtical) {
this.priceOfArtical= priceOfArtical;
}
public String getCodeOfArtical() {
return codeOfArtical;
}
public String getNameOfArtical() {
return nameOfArtical;
}
public double getPriceOfArtical() {
return priceOfArtical;
}
}
I want in main class to write something like:
Artical a1 = new Artical("841740102156", "LG Monitor", 600.00);
new ShowArticalClass(a1).do();
new WriteArticalInFileClass(new File("baza.csv"), a1).do();
so that data in file will be written in format like this:
841740102156; Monitor LG; 600.00;
914918414989; Intel CPU; 250.00;
Those 2 classes ShowArticalClass and WriteArticalInFileClass arent important, those are abstract classes.*
So my question is: How do I set format to look like this, where every line is new Artical.

A very naive implementation can be the following:
Create a class that in turn creates a CSVWriter (assuming you want to write to a CSV). That class will expose a public method allowing you to pass in a path where the desired csv file lives as well as the Artical object you want to write to this file. Using that class you will format your data and write them to the file. An example of this could be:
public class CsvWriter {
private static final Object LOCK = new Object();
private static CsvWriter writer;
private CsvWriter() {}
public static CsvWriter getInstance() {
synchronized (LOCK) {
if (null == writer) {
writer = new CsvWriter();
}
return writer;
}
}
public void writeCsv(String filePath, Artical content) throws IOException {
try (var writer = createWriter(filePath)) {
writer.append(getDataline(content)).append("\n");
}
}
private String getDataline(Artical content) {
return String.join(",", content.getCode(), content.getName(), Double.toString(content.getPrice()));
}
private PrintWriter createWriter(String stringPath) throws IOException {
var path = Paths.get(stringPath);
try {
if (Files.exists(path)) {
System.out.printf("File under path %s exists. Will append to it%n", stringPath);
return new PrintWriter(new FileWriter(path.toFile(), true));
}
return new PrintWriter(path.toFile());
} catch (Exception e) {
System.out.println("An error has occurred while writing to a file");
throw e;
}
}
}
Note that this will take into account where the file provided is already in place (thus appending to it). In any other case the file will be created and written to directly.
Call this write method in a fashion similar to this:
public static void main(String... args) throws IOException {
var artical = new Artical("1", "Test", 10.10);
CsvWriter.getInstance().writeCsv("/tmp/test1.csv", artical);
var artical2 = new Artical("2", "Test", 11.14);
CsvWriter.getInstance().writeCsv("/tmp/test1.csv", artical2);
}
With that as a starting point you can go ahead and modify the code to be able to handle list of Artical objects.
If you really need to support CSV files though I would strongly recommend into looking at the various CSV related libraries that are out there instead of implementing your own code.

Treat only the n first item of a query [duplicate]

My application needs only fixed no of records to be read
& processed. How to limit this if I am using a flatfileItemReader ?
In DB based Item Reader, I am returning null/empty list when max_limit is reached.
How to achieve the same if I am using a org.springframework.batch.item.file.FlatFileItemReader ?

For the FlatFileItemReader as well as any other ItemReader that extends AbstractItemCountingItemStreamItemReader, there is a maxItemCount property. By configuring this property, the ItemReader will continue to read until either one of the following conditions has been met:
The input has been exhausted.
The number of items read equals the maxItemCount.
In either of the two above conditions, null will be returned by the reader, indicating to Spring Batch that the input is complete.
If you have any custom ItemReader implementations that need to satisfy this requirement, I'd recommend extending the AbstractItemCountingItemStreamItemReader and going from there.

The best approch is to write a delegate which is responsible to track down number of read records and stop after a fixed count; the components should take care of execution context to allow restartability
class CountMaxReader<T> implements ItemReader<T>,ItemStream
{
private int count = 0;
private int max = 0;
private ItemReader<T> delegate;
T read() {
T next = null;
if(count < max) {
next = delegate.read();
++count;
}
return next;
}
void open(ExecutionContext executionContext) {
((ItemStream)delegate).open(executionContext);
count = executionContext.getInt('count', 0);
}
void close() {
((ItemStream)delegate).close(executionContext);
}
void update(ExecutionContext executionContext) {
((ItemStream)delegate).update(executionContext);
executionContext.putInt('count', count);
}
}
This works with any reader.

public class CountMaxFlatFileItemReader extends FlatFileItemReader {
private int counter;
private int maxCount;
public void setMaxCount(int maxCount) {
this.maxCount = maxCount;
}
#Override
public Object read() throws Exception {
counter++;
if (counter >= maxCount) {
return null; // this will stop reading
}
return super.read();
}
}
Something like this should work. The reader stops reading, as soon as null is returned.

Best way to make Jersey 2.x refuse requests with incorrect Content-Length?

I need to make Jersey refuse requests with incorrect content-length. I'm checking the content-length with a ContainerRequestFilter filter, like so:
public class ContentLengthRequiredRequestFilter implements ContainerRequestFilter {
private static Logger LOG = LoggerFactory.getLogger(ContentLengthRequiredRequestFilter.class);
#Override
public void filter(ContainerRequestContext requestContext) throws IOException {
if (requestContext.getMethod() == javax.ws.rs.HttpMethod.POST
|| requestContext.getMethod() == javax.ws.rs.HttpMethod.PUT) {
int givenContentLength = requestContext.getLength();
if (givenContentLength == -1) {
// no content-length given, but it is is required for PUT and POST requests
requestContext.abortWith(Response.status(Response.Status.LENGTH_REQUIRED).entity("No content-length provided.").build());
} else {
// now check if the given content-length is actually correct.
// since I only have a reference to an entity stream, it seems to be that
// reading the entire stream and then resetting it is not a good solution.
// Should I be checking this somewhere else, perhaps somewhere the entity is already available or where I can get the total size of the body without causing the stream to be read twice? Or is there a better way to get the body size here?
}
}
}
}
As you can see in the code block comment, should I be checking this somewhere else, perhaps somewhere the entity is already available or where I can get the total size of the body without causing the stream to be read twice? Or is there a better way to get the body size there?
Thank you!!

If you don't want to buffer the incoming entity (input stream) I'd take a look at ReaderInterceptor interface. Instances of this contract are invoked only in cases when incoming requests contain an entity (typically POST, PUT). In the interceptor you can pretty much do anything with entity. A simple snippet (not covering all cases) could look like:
public class MyInterceptor implements ReaderInterceptor {
#Override
public Object aroundReadFrom(final ReaderInterceptorContext context) throws IOException, WebApplicationException {
final InputStream old = context.getInputStream();
final String first = context.getHeaders().getFirst("Content-Length");
final Long declared = first == null ? -1 : Long.valueOf(first);
context.setInputStream(new InputStream() {
private long length = 0;
private int mark = 0;
#Override
public int read() throws IOException {
final int read = old.read();
readAndCheck(read != -1 ? 1 : 0);
return read;
}
#Override
public int read(final byte[] b) throws IOException {
final int read = old.read(b);
readAndCheck(read != -1 ? read : 0);
return read;
}
#Override
public int read(final byte[] b, final int off, final int len) throws IOException {
final int read = old.read(b, off, len);
readAndCheck(read != -1 ? read : 0);
return read;
}
#Override
public long skip(final long n) throws IOException {
final long skip = old.skip(n);
readAndCheck(skip != -1 ? skip : 0);
return skip;
}
#Override
public int available() throws IOException {
return old.available();
}
#Override
public void close() throws IOException {
old.close();
}
#Override
public synchronized void mark(final int readlimit) {
mark += readlimit;
old.mark(readlimit);
}
#Override
public synchronized void reset() throws IOException {
this.length = 0;
readAndCheck(mark);
old.reset();
}
#Override
public boolean markSupported() {
return old.markSupported();
}
private void readAndCheck(final long read) {
this.length += read;
if (this.length > declared) {
throw new WebApplicationException(
Response.status(Response.Status.LENGTH_REQUIRED)
.entity("No content-length provided.")
.build());
}
}
});
final Object entity = context.proceed();
context.setInputStream(old);
return entity;
}
}
In the interceptor above I am setting my own input stream that counts and checks number of bytes read from the original input stream. This implementation however also depends on how the underlying container processes the input stream (i.e. if it also checks content-length when reading input stream).

Dynamically run java code with Process

I have created a class that dynamically compiles, loads in a CustomClassLoader, and executes an in-memory java source (i.e.: without class files) java source by invoking it's main method.
I need to capture the StdOut, StdIn, and StdErr, although it's not possible doing so in my current code. (Compiler API + Classloader + Reflection)
My requirements might be the same as asked in this question - and as suggested by the accepted answer - use java.lang.Process. This is easier if I had physical files available in the file system, but I have not in this case.
I am planning to remove the Classloader + Reflection strategy and use the suggestion instead; although, I'm not familiar in actually redirecting the Std* using the Process class.
How can I do this in Java 7? (Snippets are highly appreciated) Or more importantly, is there a better approach?

Take a backup of the existing outputstream.
PrintStream realSystemOut = System.out;
Set it to other outputstream [ fileOutputStream, or some other streams]
PrintStream overridePrintStream = new PrintStream(new FileOutputStream("log.txt"));
System.setOut(overridePrintStream );
----- process -----
place the actual stream back to System.out
System.setOut(realSystemOut);
Thanks

Java allows you to supply your own PrintStream to override stdout and stderr and a InputStream for stdin.
Personally, I don't like simply throwing away the original stream, cause I tend to only want to redirect or parse it, not stop it (although you could do that as well).
Here is a simple example of the idea...
import java.io.IOException;
import java.io.OutputStream;
import java.io.PrintStream;
public class RedirectStdOut {
public static void main(String[] args) {
Consumer stdConsumer = new Consumer() {
#Override
public void processLine(StreamCapturer capturer, String text) {
}
#Override
public void processCharacter(StreamCapturer capturer, char character) {
capturer.getParent().print(character);
}
};
StreamCapturer stdout = new StreamCapturer(stdConsumer, System.out);
StreamCapturer stderr = new StreamCapturer(stdConsumer, System.err);
System.setOut(new PrintStream(stdout));
System.setErr(new PrintStream(stderr));
System.out.println("This is a test");
System.err.println("This is an err");
}
public static interface Consumer {
public void processLine(StreamCapturer capturer, String text);
public void processCharacter(StreamCapturer capturer, char character);
}
public static class StreamCapturer extends OutputStream {
private StringBuilder buffer;
private Consumer consumer;
private PrintStream parent;
private boolean echo = false;
public StreamCapturer(Consumer consumer, PrintStream parent) {
buffer = new StringBuilder(128);
this.parent = parent;
this.consumer = consumer;
}
public PrintStream getParent() {
return parent;
}
public boolean shouldEcho() {
return echo;
}
public void setEcho(boolean echo) {
this.echo = echo;
}
#Override
public void write(int b) throws IOException {
char c = (char) b;
String value = Character.toString(c);
buffer.append(value);
if (value.equals("\n")) {
consumer.processLine(this, value);
buffer.delete(0, buffer.length());
}
consumer.processCharacter(this, c);
if (shouldEcho()) {
parent.print(c);
}
}
}
}
Now the StreamCapturer has the ability to echo the output if you want, I've turned it off to demonstrate the use of the Consumer. I would normally use the Consumer to process what is coming through the stream, based on your needs, you can wait for the complete line or process the individual characters...

Determine file being used by FileHandler

I am creating a java.util.logging.FileHandler that is allowed to cycle through files. When multiple instances of my application are run, a new log file is created for each instance of the application. I need to know what file is being used by the application because I want to upload the log file to my servers for further review. How can I tell what file is being used by a certain FileHandler?

The easiest way is to put some kind of identifier in the file name itself, i.e. the pattern argument when you create the FileHandler. Since these are instances of the same application, one way to distinguish them is by their process id, so you could make that part of the pattern. A better approach is to pass in an identifier through the command line and use that to make your filename. That way you control the files being created in some sense. Finally, if your application has some knowledge of why it's different from all the others, for example it connects to a particular database server, then you could just use that database server name as part of the filename.
EDIT: There does not seem to be any API to get the name of the file being used by a FileHandler. I would suggest looking into the logging extensions in x4juli (which ports a lot of the log4j functionality to the java.util.logging specs):
http://www.x4juli.org/
You should be able to substitute an instance of their FileHandler which provides a getFile() method:
http://www.x4juli.org/api/org/x4juli/handlers/FileHandler.html

Actually, you could do this much simpler by simply extending FileHandler yourself. For example...
MyFileHandler.java:
import java.io.IOException;
import java.util.logging.FileHandler;
public class MyFileHandler extends FileHandler {
protected String _MyFileHandler_Patern;
public MyFileHandler(String pattern) throws IOException {
_MyFileHandler_Patern = pattern;
}
public String getMyFileHandlerPattern() {
return _MyFileHandler_Patern;
}
}
DeleteMe.java:
import java.io.IOException;
import java.util.logging.Handler;
import java.util.logging.Logger;
public class DeleteMe {
public static void main(String[] args) throws IOException {
Logger log = Logger.getLogger(DeleteMe.class.getName());
MyFileHandler output = new MyFileHandler("output.log");
log.addHandler(output);
for (Handler handler : log.getHandlers()) {
if (handler instanceof MyFileHandler) {
MyFileHandler x = (MyFileHandler) handler;
if ("output.log".equals(x.getMyFileHandlerPattern())) {
System.out.println("found hanlder writing to output.log");
}
}
}
}
}

OK, I do have to say that FileHandler not providing a way to determine the log file is seriously dumb.
I wound up writing a function called "chooseFile()" which searches /tmp for the next available log file name and returns that file. You can then pass the name of that file into new FileHandler().
/**
* Utility: select a log file. File is created immediately to reserve
* its name.
*/
static public File chooseFile(final String basename) throws IOException {
final int nameLen = basename.length();
File tmpDir = new File(System.getProperty("java.io.tmpdir"));
String[] logs = tmpDir.list(new FilenameFilter() {
public boolean accept(File d, String f) {
return f.startsWith(basename);
}
});
int count = 0;
if (logs.length > 0) {
for (String name : logs) {
int n = atoi(name.substring(nameLen));
if (n >= count) count = n + 1;
}
}
String filename = String.format("%s%d.log", basename, count);
File logFile = new File(tmpDir, filename);
logFile.createNewFile();
return logFile;
}

Here's my rather hacky way around it. It works for the default if you don't use any format strings, and should work if you use the g and u format strings in filename, but not the others.
public class FriendlyFileHandler extends FileHandler {
/***
* In order to ensure the most recent log file is the file this one owns,
* we flush before checking the directory for most recent file.
*
* But we must keep other log handlers from flushing in between and making
* a NEW recent file.
*/
private static Object[] flushLock = new Object[0];
private String pattern;
public FriendlyFileHandler(String pattern, int maxLogLengthInBytes, int count) throws IOException,
SecurityException {
super(pattern, maxLogLengthInBytes, count);
this.pattern = pattern;
}
/***
* Finds the most recent log file matching the pattern.
* This is just a guess - if you have a complicated pattern
* format it may not work.
*
* IMPORTANT: This log file is still in use. You must
* removeHandler() on the logger first, .close() this handler,
* then add a NEW handler to your logger. THEN, you can read
* the file.
*
* Currently supported format strings: g, u
*
* #return A File of the current log file, or null on error.
*/
public synchronized File getCurrentLogFile() {
synchronized(flushLock) {
// so the file has the most recent date on it.
flush();
final String patternRegex =
// handle incremental number formats
pattern.replaceAll("%[gu]", "\\d*") +
// handle default case where %g is appended to end
"(\\.\\d*)?$";
final Pattern re = Pattern.compile(patternRegex);
final Matcher matcher = re.matcher("");
// check all files in the directory where this log would be
final File basedir = new File(pattern).getParentFile();
final File[] logs = basedir.listFiles(new FileFilter() {
#Override
public boolean accept(final File pathname) {
// only get files that are part of the pattern
matcher.reset(pathname.getAbsolutePath());
return matcher.find();
}
});
return findMostRecentLog(logs);
}
}
private File findMostRecentLog(File[] logs) {
if (logs.length > 0) {
long mostRecentDate = 0;
int mostRecentIdx = 0;
for (int i = 0; i < logs.length; i++) {
final long d = logs[i].lastModified();
if (d >= mostRecentDate) {
mostRecentDate = d;
mostRecentIdx = i;
}
}
return logs[mostRecentIdx];
}
else {
return null;
}
}
#Override
public synchronized void flush() {
// only let one Handler flush at a time.
synchronized(flushLock) {
super.flush();
}
}
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Cloud Dataflow: reading entire text files rather than lines by line - java

I'm looking for a way to read ENTIRE files so that every file will be read entirely to a single String. I want to pass a pattern of JSON text files on gs://my_bucket//.json, have a ParDo then process each and every file entirely. What's the best approach to it?

Related

How to write data in file in specific format with FileOutputStream and PrintWriter

Treat only the n first item of a query [duplicate]

Best way to make Jersey 2.x refuse requests with incorrect Content-Length?

Dynamically run java code with Process

Determine file being used by FileHandler

Categories

Resources

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Cloud Dataflow: reading entire text files rather than lines by line - java

I'm looking for a way to read ENTIRE files so that every file will be read entirely to a single String. I want to pass a pattern of JSON text files on gs://my_bucket/*/*.json, have a ParDo then process each and every file entirely. What's the best approach to it?

Related

How to write data in file in specific format with FileOutputStream and PrintWriter

Treat only the n first item of a query [duplicate]

Best way to make Jersey 2.x refuse requests with incorrect Content-Length?

Dynamically run java code with Process

Determine file being used by FileHandler

Categories

Resources

I'm looking for a way to read ENTIRE files so that every file will be read entirely to a single String. I want to pass a pattern of JSON text files on gs://my_bucket//.json, have a ParDo then process each and every file entirely. What's the best approach to it?