My application needs only fixed no of records to be read
& processed. How to limit this if I am using a flatfileItemReader ?
In DB based Item Reader, I am returning null/empty list when max_limit is reached.
How to achieve the same if I am using a org.springframework.batch.item.file.FlatFileItemReader ?
For the FlatFileItemReader as well as any other ItemReader that extends AbstractItemCountingItemStreamItemReader, there is a maxItemCount property. By configuring this property, the ItemReader will continue to read until either one of the following conditions has been met:
The input has been exhausted.
The number of items read equals the maxItemCount.
In either of the two above conditions, null will be returned by the reader, indicating to Spring Batch that the input is complete.
If you have any custom ItemReader implementations that need to satisfy this requirement, I'd recommend extending the AbstractItemCountingItemStreamItemReader and going from there.
The best approch is to write a delegate which is responsible to track down number of read records and stop after a fixed count; the components should take care of execution context to allow restartability
class CountMaxReader<T> implements ItemReader<T>,ItemStream
{
private int count = 0;
private int max = 0;
private ItemReader<T> delegate;
T read() {
T next = null;
if(count < max) {
next = delegate.read();
++count;
}
return next;
}
void open(ExecutionContext executionContext) {
((ItemStream)delegate).open(executionContext);
count = executionContext.getInt('count', 0);
}
void close() {
((ItemStream)delegate).close(executionContext);
}
void update(ExecutionContext executionContext) {
((ItemStream)delegate).update(executionContext);
executionContext.putInt('count', count);
}
}
This works with any reader.
public class CountMaxFlatFileItemReader extends FlatFileItemReader {
private int counter;
private int maxCount;
public void setMaxCount(int maxCount) {
this.maxCount = maxCount;
}
#Override
public Object read() throws Exception {
counter++;
if (counter >= maxCount) {
return null; // this will stop reading
}
return super.read();
}
}
Something like this should work. The reader stops reading, as soon as null is returned.
Related
In the following code, tick emits a new object every three seconds. I'm trying to count the number of emitted objects every second using groupedWithin (which ignores empty groups). Is there any way in Akka Streams for the following code to print 0 in periods when tick does not emit any objects?
Source.tick(Duration.ZERO, Duration.ofSeconds(3), new Object())
.groupedWithin(Integer.MAX_VALUE, Duration.ofSeconds(1))
.map(List::size)
.runWith(Sink.foreach(e -> System.out.println(e)), materializer);
In other words, I'd like the output of this code to be this sequence: 1 0 0 1 0 0 1 ... (every second) instead of 1 1 1 ... (every three seconds).
EDIT: This is the best workaround I have come up with so far (using keepAlive to send some special objects if the upstream is idle):
Source.tick(Duration.ZERO, Duration.ofSeconds(3), new Object())
.keepAlive(Duration.ofSeconds(1), KeepAliveElement::new)
.groupedWithin(Integer.MAX_VALUE, Duration.ofSeconds(1))
.map(lst -> lst.stream().filter(e -> !(e instanceof KeepAliveElement)).collect(Collectors.toList()))
.map(List::size)
.runWith(Sink.foreach(e -> System.out.println(e)), materializer);
Is there any better way to do this?
I thought this would be of normal difficulty, I was wrong. One thing I wanted to do is to ensure that the flow counting items that pass through the stream does not keep a reference to each item it sees: if many items pass in the aggregation period, you will end up with an unnecessarily big list in memory (even if only for a second) and the performance penalty to add (many) items to it. The following solution, although complex, keeps only a counter.
NOTE: Although I tested the happy scenario, I cannot say this is battle-proven, so use with caution!
Based on Akka's GroupedWeightedWithin and the documentation here:
public class CountInPeriod<T> extends GraphStage<FlowShape<T, Integer>> {
public Inlet<T> in = Inlet.<T>create("CountInPeriod.in");
public Outlet<Integer> out = Outlet.<Integer>create("CountInPeriod.out");
private FlowShape<T, Integer> shape = FlowShape.of(in, out);
private Duration duration;
public CountInPeriod(Duration duration) {
this.duration = duration;
}
#Override
public GraphStageLogic createLogic(Attributes inheritedAttributes) {
return new TimerGraphStageLogic(shape) {
private int counter = 0;
private int bufferPushCounter = -1;
{
setHandler(in, new AbstractInHandler() {
#Override public void onPush() throws Exception, Exception {
grab(in);
counter++;
pull(in);
}
});
setHandler(out, new AbstractOutHandler() {
#Override public void onPull() throws Exception, Exception {
if (bufferPushCounter >= 0) {
push(out, bufferPushCounter);
bufferPushCounter = -1;
}
}
});
}
#Override
public void preStart() throws Exception, Exception {
scheduleWithFixedDelay(CountInPeriod.class, duration, duration);
pull(in);
}
#Override
public void onTimer(Object timerKey) throws Exception, Exception {
if (isAvailable(out)) emitCounter();
else bufferPush();
}
private void emitCounter() {
push(out, counter);
counter = 0;
bufferPushCounter = -1;
}
private void bufferPush() {
bufferPushCounter = counter;
counter = 0;
}
};
}
#Override
public FlowShape<T, Integer> shape() {
return shape;
}
}
Test code:
public class GroupTicked {
final static ActorSystem as = ActorSystem.create("as");
public static void main(String... args) throws Exception {
CompletionStage<Done> done = Source.tick(Duration.ZERO, Duration.ofSeconds(3), new Object())
.take(7) // to finish in finite time...
.via(new CountInPeriod<>(Duration.ofSeconds(1)))
.runWith(Sink.foreach(e -> System.out.println(System.currentTimeMillis() + " -> " + e)), as);
done.thenAccept(x -> as.terminate());
}
}
Quite new to flink stream processing. Here is my requirement:
Alert the user when 2 or more elements were received in the last 20 seconds. If less than 2 elements were received in 20 seconds dont alert, just restart the counting and time.
The count and interval varies for each element.
Here's my code:
dataStream
.keyBy("id")
.window(EventTimeSessionWindows.withDynamicGap((event) -> event.getThresholdInterval()))
.trigger(new CountTriggerWithTimeout<TimeWindow>())
TriggerCode:
public class CountTriggerWithTimeout<W extends TimeWindow> extends Trigger<SystemEvent, W> {
private ReducingStateDescriptor<Long> countState =
new ReducingStateDescriptor<Long>("count", new Sum(), LongSerializer.INSTANCE);
private ReducingStateDescriptor<Long> processedState =
new ReducingStateDescriptor<Long>("processed", new Sum(), LongSerializer.INSTANCE);
#Override
public TriggerResult onElement(SystemEvent element, long timestamp, W window, TriggerContext ctx)
throws Exception {
ReducingState<Long> count = ctx.getPartitionedState(countState);
ReducingState<Long> processed = ctx.getPartitionedState(processedState);
count.add(1L);
processed.add(0L);
if (count.get() >= element.getThresholdCount() && processed.get() == 0) {
processed.add(1L);
return TriggerResult.FIRE_AND_PURGE;
}
if (timestamp >= window.getEnd()) {
return TriggerResult.PURGE;
}
return TriggerResult.CONTINUE;
}
#Override
public TriggerResult onProcessingTime(long time, W window, TriggerContext ctx) throws Exception {
return TriggerResult.CONTINUE;
}
#Override
public TriggerResult onEventTime(long time, W window, TriggerContext ctx) throws Exception {
return TriggerResult.CONTINUE;
}
#Override
public void clear(W window, TriggerContext ctx) throws Exception {
ctx.getPartitionedState(countState).clear();
ctx.getPartitionedState(processedState).clear();
}
#Override
public boolean canMerge() {
return true;
}
class Sum implements ReduceFunction<java.lang.Long> {
#Override
public Long reduce(Long value1, Long value2) throws Exception {
return value1 + value2;
}
}
}
Earlier when I was using
dataStream
.timeWindow(Time.seconds(1))
.trigger(new CountTriggerWithTimeout<TimeWindow>())
everything was working perfectly fine. Since there is a requirement to read the window time from element, I started using EventTimeSessionWindow and added canMerge() function in the trigger. Since then, nothing is working. clear() is not getting invoked ever, nor are onProcessingTime() & onEventTime(). I see that timestamp is always set to the same value, irrespective of when the element was received.
My requirement is to "fire & purge" when count >= threshold within event.getThresholdInterval(). If count < threshold within event.getThresholdInterval() then purge i.e. invoke clear to clear the count and state and restart. Is there a way to achieve this with timeWindow instead of EventTimeSessionWindows?
Please help me fix this issue.
Thanks...
Why dont you use a simple Tumbling Windows of 20 seconds and count the elements on it:
source
.keyBy("id")
.timeWindow(Time.seconds(20))
.process(new ProcessWindowFunction<Tuple2<String, Integer>, String, Tuple, TimeWindow>() {
#Override
public void process(Tuple key, ProcessWindowFunction<Tuple2<String, Integer>, String, Tuple, TimeWindow>.Context ctx,
Iterable<Tuple2<String, Integer>> in, Collector<String> out) throws Exception {
if (Lists.newArrayList(in).size() >= 2) {
out.collect("Two or more elements between "
+ Instant.ofEpochMilli(ctx.window().getStart())
+ " " + Instant.ofEpochMilli(ctx.window().getEnd()));
}
}
})
In a Spring-based application I have a service which performs the calculation of some Index. Index is relatively expensive to calculate (say, 1s) but relatively cheap to check for actuality (say, 20ms). Actual code does not matter, it goes along the following lines:
public Index getIndex() {
return calculateIndex();
}
public Index calculateIndex() {
// 1 second or more
}
public boolean isIndexActual(Index index) {
// 20ms or less
}
I'm using Spring Cache to cache the calculated index via #Cacheable annotation:
#Cacheable(cacheNames = CacheConfiguration.INDEX_CACHE_NAME)
public Index getIndex() {
return calculateIndex();
}
We currently configure GuavaCache as cache implementation:
#Bean
public Cache indexCache() {
return new GuavaCache(INDEX_CACHE_NAME, CacheBuilder.newBuilder()
.expireAfterWrite(indexCacheExpireAfterWriteSeconds, TimeUnit.SECONDS)
.build());
}
#Bean
public CacheManager indexCacheManager(List<Cache> caches) {
SimpleCacheManager cacheManager = new SimpleCacheManager();
cacheManager.setCaches(caches);
return cacheManager;
}
What I also need is to check if cached value is still actual and refresh it (ideally asynchronously) if it is not. So ideally it should go as follows:
When getIndex() is called, Spring checks if there is a value in the cache.
If not, new value is loaded via calculateIndex() and stored in the cache
If yes, the existing value is checked for actuality via isIndexActual(...).
If old value is actual, it is returned.
If old value is not actual, it is returned, but removed from the cache and loading of the new value is triggered as well.
Basically I want to serve the value from the cache very fast (even if it is obsolete) but also trigger refreshing right away.
What I've got working so far is checking for actuality and eviction:
#Cacheable(cacheNames = INDEX_CACHE_NAME)
#CacheEvict(cacheNames = INDEX_CACHE_NAME, condition = "target.isObsolete(#result)")
public Index getIndex() {
return calculateIndex();
}
This checks triggers eviction if the result is obsolete and returns the old value immediately even if it is the case. But this does not refresh the value in the cache.
Is there a way to configure Spring Cache to actively refresh obsolete values after eviction?
Update
Here's a MCVE.
public static class Index {
private final long timestamp;
public Index(long timestamp) {
this.timestamp = timestamp;
}
public long getTimestamp() {
return timestamp;
}
}
public interface IndexCalculator {
public Index calculateIndex();
public long getCurrentTimestamp();
}
#Service
public static class IndexService {
#Autowired
private IndexCalculator indexCalculator;
#Cacheable(cacheNames = "index")
#CacheEvict(cacheNames = "index", condition = "target.isObsolete(#result)")
public Index getIndex() {
return indexCalculator.calculateIndex();
}
public boolean isObsolete(Index index) {
long indexTimestamp = index.getTimestamp();
long currentTimestamp = indexCalculator.getCurrentTimestamp();
if (index == null || indexTimestamp < currentTimestamp) {
return true;
} else {
return false;
}
}
}
Now the test:
#Test
public void test() {
final Index index100 = new Index(100);
final Index index200 = new Index(200);
when(indexCalculator.calculateIndex()).thenReturn(index100);
when(indexCalculator.getCurrentTimestamp()).thenReturn(100L);
assertThat(indexService.getIndex()).isSameAs(index100);
verify(indexCalculator).calculateIndex();
verify(indexCalculator).getCurrentTimestamp();
when(indexCalculator.getCurrentTimestamp()).thenReturn(200L);
when(indexCalculator.calculateIndex()).thenReturn(index200);
assertThat(indexService.getIndex()).isSameAs(index100);
verify(indexCalculator, times(2)).getCurrentTimestamp();
// I'd like to see indexCalculator.calculateIndex() called after
// indexService.getIndex() returns the old value but it does not happen
// verify(indexCalculator, times(2)).calculateIndex();
assertThat(indexService.getIndex()).isSameAs(index200);
// Instead, indexCalculator.calculateIndex() os called on
// the next call to indexService.getIndex()
// I'd like to have it earlier
verify(indexCalculator, times(2)).calculateIndex();
verify(indexCalculator, times(3)).getCurrentTimestamp();
verifyNoMoreInteractions(indexCalculator);
}
I'd like to have the value refreshed shortly after it was evicted from the cache. At the moment it is refreshed on the next call of getIndex() first. If the value would have been refreshed right after eviction, this would save me 1s later on.
I've tried #CachePut, but it also does not get me the desired effect. The value is refreshed, but the method is always executed, no matter what condition or unless are.
The only way I see at the moment is to call getIndex() twice(second time async/non-blocking). But that's kind of stupid.
I would say the easiest way of doing what you need is to create a custom Aspect which will do all the magic transparently and which can be reused in more places.
So assuming you have spring-aop and aspectj dependencies on your class path the following aspect will do the trick.
#Aspect
#Component
public class IndexEvictorAspect {
#Autowired
private Cache cache;
#Autowired
private IndexService indexService;
private final ReentrantLock lock = new ReentrantLock();
#AfterReturning(pointcut="hello.IndexService.getIndex()", returning="index")
public void afterGetIndex(Object index) {
if(indexService.isObsolete((Index) index) && lock.tryLock()){
try {
Index newIndex = indexService.calculateIndex();
cache.put(SimpleKey.EMPTY, newIndex);
} finally {
lock.unlock();
}
}
}
}
Several things to note
As your getIndex() method does not have a parameters it is stored in the cache for key SimpleKey.EMPTY
The code assumes that IndexService is in the hello package.
Something like the following could refresh the cache in the desired way and keep the implementation simple and straightforward.
There is nothing wrong about writing clear and simple code, provided it satisfies the requirements.
#Service
public static class IndexService {
#Autowired
private IndexCalculator indexCalculator;
public Index getIndex() {
Index cachedIndex = getCachedIndex();
if (isObsolete(cachedIndex)) {
evictCache();
asyncRefreshCache();
}
return cachedIndex;
}
#Cacheable(cacheNames = "index")
public Index getCachedIndex() {
return indexCalculator.calculateIndex();
}
public void asyncRefreshCache() {
CompletableFuture.runAsync(this::getCachedIndex);
}
#CacheEvict(cacheNames = "index")
public void evictCache() { }
public boolean isObsolete(Index index) {
long indexTimestamp = index.getTimestamp();
long currentTimestamp = indexCalculator.getCurrentTimestamp();
if (index == null || indexTimestamp < currentTimestamp) {
return true;
} else {
return false;
}
}
}
EDIT1:
The caching abstraction based on #Cacheable and #CacheEvict will not work in this case. Those behaviour is following: during #Cacheable call if the value is in cache - return value from the cache, otherwise compute and put into cache and then return; during #CacheEvict the value is removed from the cache, so from this moment there is no value in cache, and thus the first incoming call on #Cacheable will force the recalculation and putting into cache. The use #CacheEvict(condition="") will only do the check on condition wether to remove from cache value during this call based on this condition. So after each invalidation the #Cacheable method will run this heavyweight routine to populate cache.
to have the value beign stored in the cache manager, and updated asynchronously, I would propose to reuse following routine:
#Inject
#Qualifier("my-configured-caching")
private Cache cache;
private ReentrantLock lock = new ReentrantLock();
public Index getIndex() {
synchronized (this) {
Index storedCache = cache.get("singleKey_Or_AnythingYouWant", Index.class);
if (storedCache == null ) {
this.lock.lock();
storedCache = indexCalculator.calculateIndex();
this.cache.put("singleKey_Or_AnythingYouWant", storedCache);
this.lock.unlock();
}
}
if (isObsolete(storedCache)) {
if (!lock.isLocked()) {
lock.lock();
this.asyncUpgrade()
}
}
return storedCache;
}
The first construction is sycnhronized, just to block all the upcoming calls to wait until the first call populates cache.
then the system checks wether the cache should be regenerated. if yes, single call for asynchronous update of the value is called, and the current thread is returning the cached value. upcoming call once the cache is in state of recalculation will simply return the most recent value from the cache. and so on.
with solution like this you will be able to reuse huge volumes of memory, of lets say hazelcast cache manager, as well as multiple key-based cache storage and keep your complex logic of cache actualization and eviction.
OR IF you like the #Cacheable annotations, you can do this following way:
#Cacheable(cacheNames = "index", sync = true)
public Index getCachedIndex() {
return new Index();
}
#CachePut(cacheNames = "index")
public Index putIntoCache() {
return new Index();
}
public Index getIndex() {
Index latestIndex = getCachedIndex();
if (isObsolete(latestIndex)) {
recalculateCache();
}
return latestIndex;
}
private ReentrantLock lock = new ReentrantLock();
#Async
public void recalculateCache() {
if (!lock.isLocked()) {
lock.lock();
putIntoCache();
lock.unlock();
}
}
Which is almost the same, as above, but reuses spring's Caching annotation abstraction.
ORIGINAL:
Why you are trying to resolve this via caching? If this is simple value (not key-based, you can organize your code in simpler manner, keeping in mind that spring service is singleton by default)
Something like that:
#Service
public static class IndexService {
#Autowired
private IndexCalculator indexCalculator;
private Index storedCache;
private ReentrantLock lock = new ReentrantLock();
public Index getIndex() {
if (storedCache == null ) {
synchronized (this) {
this.lock.lock();
Index result = indexCalculator.calculateIndex();
this.storedCache = result;
this.lock.unlock();
}
}
if (isObsolete()) {
if (!lock.isLocked()) {
lock.lock();
this.asyncUpgrade()
}
}
return storedCache;
}
#Async
public void asyncUpgrade() {
Index result = indexCalculator.calculateIndex();
synchronized (this) {
this.storedCache = result;
}
this.lock.unlock();
}
public boolean isObsolete() {
long currentTimestamp = indexCalculator.getCurrentTimestamp();
if (storedCache == null || storedCache.getTimestamp() < currentTimestamp) {
return true;
} else {
return false;
}
}
}
i.e. first call is synchronized and you have to wait until the results are populated. Then if stored value is obsolete the system will perform asynchronous update of the value, but the current thread will receive the stored "cached" value.
I had also introduced the reentrant lock to restrict single upgrade of stored index at time.
I would use a Guava LoadingCache in your index service, like shown in the code sample below:
LoadingCache<Key, Graph> graphs = CacheBuilder.newBuilder()
.maximumSize(1000)
.refreshAfterWrite(1, TimeUnit.MINUTES)
.build(
new CacheLoader<Key, Graph>() {
public Graph load(Key key) { // no checked exception
return getGraphFromDatabase(key);
}
public ListenableFuture<Graph> reload(final Key key, Graph prevGraph) {
if (neverNeedsRefresh(key)) {
return Futures.immediateFuture(prevGraph);
} else {
// asynchronous!
ListenableFutureTask<Graph> task = ListenableFutureTask.create(new Callable<Graph>() {
public Graph call() {
return getGraphFromDatabase(key);
}
});
executor.execute(task);
return task;
}
}
});
You can create an async reloading cache loader by calling Guava's method:
public abstract class CacheLoader<K, V> {
...
public static <K, V> CacheLoader<K, V> asyncReloading(
final CacheLoader<K, V> loader, final Executor executor) {
...
}
}
The trick is to run the reload operation in a separate thread, using a ThreadPoolExecutor for example:
On first call, the cache is populated by the load() method, thus it may take some time to answer,
On subsequent calls, when the value needs to be refreshed, it's being computed asynchronously while still serving the stale value. It will serve the updated value once the refresh has completed.
I think it can be something like
#Autowired
IndexService indexService; // self injection
#Cacheable(cacheNames = INDEX_CACHE_NAME)
#CacheEvict(cacheNames = INDEX_CACHE_NAME, condition = "target.isObsolete(#result) && #indexService.calculateIndexAsync()")
public Index getIndex() {
return calculateIndex();
}
public boolean calculateIndexAsync() {
someAsyncService.run(new Runable() {
public void run() {
indexService.updateIndex(); // require self reference to use Spring caching proxy
}
});
return true;
}
#CachePut(cacheNames = INDEX_CACHE_NAME)
public Index updateIndex() {
return calculateIndex();
}
Above code has a problem, if you call to getIndex() again while it is being updated, it will be calculated again. To prevent this, it better to don't use #CacheEvict and let the #Cacheable return the obsolete value until the index has done calculated.
#Autowired
IndexService indexService; // self injection
#Cacheable(cacheNames = INDEX_CACHE_NAME, condition = "!(target.isObsolete(#result) && #indexService.calculateIndexAsync())")
public Index getIndex() {
return calculateIndex();
}
public boolean calculateIndexAsync() {
if (!someThreadSafeService.isIndexBeingUpdated()) {
someAsyncService.run(new Runable() {
public void run() {
indexService.updateIndex(); // require self reference to use Spring caching proxy
}
});
}
return false;
}
#CachePut(cacheNames = INDEX_CACHE_NAME)
public Index updateIndex() {
return calculateIndex();
}
I'm looking for a way to read ENTIRE files so that every file will be read entirely to a single String.
I want to pass a pattern of JSON text files on gs://my_bucket/*/*.json, have a ParDo then process each and every file entirely.
What's the best approach to it?
I am going to give the most generally useful answer, even though there are special cases [1] where you might do something different.
I think what you want to do is to define a new subclass of FileBasedSource and use Read.from(<source>). Your source will also include a subclass of FileBasedReader; the source contains the configuration data and the reader actually does the reading.
I think a full description of the API is best left to the Javadoc, but I will highlight the key override points and how they relate to your needs:
FileBasedSource#isSplittable() you will want to override and return false. This will indicate that there is no intra-file splitting.
FileBasedSource#createForSubrangeOfFile(String, long, long) you will override to return a sub-source for just the file specified.
FileBasedSource#createSingleFileReader() you will override to produce a FileBasedReader for the current file (the method should assume it is already split to the level of a single file).
To implement the reader:
FileBasedReader#startReading(...) you will override to do nothing; the framework will already have opened the file for you, and it will close it.
FileBasedReader#readNextRecord() you will override to read the entire file as a single element.
[1] One example easy special case is when you actually have a small number of files, you can expand them prior to job submission, and they all take the same amount of time to process. Then you can just use Create.of(expand(<glob>)) followed by ParDo(<read a file>).
Was looking for similar solution myself. Following Kenn's recommendations and few other references such as XMLSource.java, created the following custom source which seems to be working fine.
I am not a developer so if anyone has suggestions on how to improve it, please feel free to contribute.
public class FileIO {
// Match TextIO.
public static Read.Bounded<KV<String,String>> readFilepattern(String filepattern) {
return Read.from(new FileSource(filepattern, 1));
}
public static class FileSource extends FileBasedSource<KV<String,String>> {
private String filename = null;
public FileSource(String fileOrPattern, long minBundleSize) {
super(fileOrPattern, minBundleSize);
}
public FileSource(String filename, long minBundleSize, long startOffset, long endOffset) {
super(filename, minBundleSize, startOffset, endOffset);
this.filename = filename;
}
// This will indicate that there is no intra-file splitting.
#Override
public boolean isSplittable(){
return false;
}
#Override
public boolean producesSortedKeys(PipelineOptions options) throws Exception {
return false;
}
#Override
public void validate() {}
#Override
public Coder<KV<String,String>> getDefaultOutputCoder() {
return KvCoder.of(StringUtf8Coder.of(),StringUtf8Coder.of());
}
#Override
public FileBasedSource<KV<String,String>> createForSubrangeOfFile(String fileName, long start, long end) {
return new FileSource(fileName, getMinBundleSize(), start, end);
}
#Override
public FileBasedReader<KV<String,String>> createSingleFileReader(PipelineOptions options) {
return new FileReader(this);
}
}
/**
* A reader that should read entire file of text from a {#link FileSource}.
*/
private static class FileReader extends FileBasedSource.FileBasedReader<KV<String,String>> {
private static final Logger LOG = LoggerFactory.getLogger(FileReader.class);
private ReadableByteChannel channel = null;
private long nextOffset = 0;
private long currentOffset = 0;
private boolean isAtSplitPoint = false;
private final ByteBuffer buf;
private static final int BUF_SIZE = 1024;
private KV<String,String> currentValue = null;
private String filename;
public FileReader(FileSource source) {
super(source);
buf = ByteBuffer.allocate(BUF_SIZE);
buf.flip();
this.filename = source.filename;
}
private int readFile(ByteArrayOutputStream out) throws IOException {
int byteCount = 0;
while (true) {
if (!buf.hasRemaining()) {
buf.clear();
int read = channel.read(buf);
if (read < 0) {
break;
}
buf.flip();
}
byte b = buf.get();
byteCount++;
out.write(b);
}
return byteCount;
}
#Override
protected void startReading(ReadableByteChannel channel) throws IOException {
this.channel = channel;
}
#Override
protected boolean readNextRecord() throws IOException {
currentOffset = nextOffset;
ByteArrayOutputStream buf = new ByteArrayOutputStream();
int offsetAdjustment = readFile(buf);
if (offsetAdjustment == 0) {
// EOF
return false;
}
nextOffset += offsetAdjustment;
isAtSplitPoint = true;
currentValue = KV.of(this.filename,CoderUtils.decodeFromByteArray(StringUtf8Coder.of(), buf.toByteArray()));
return true;
}
#Override
protected boolean isAtSplitPoint() {
return isAtSplitPoint;
}
#Override
protected long getCurrentOffset() {
return currentOffset;
}
#Override
public KV<String,String> getCurrent() throws NoSuchElementException {
return currentValue;
}
}
}
A much simpler method is to generate the list of filenames and write a function to process each file individually. I'm showing Python, but Java is similar:
def generate_filenames():
for shard in xrange(0, 300):
yield 'gs://bucket/some/dir/myfilname-%05d-of-00300' % shard
with beam.Pipeline(...) as p:
(p | generate_filenames()
| beam.FlatMap(lambda filename: readfile(filename))
| ...)
FileIO does that for you without the need to implement your own FileBasedSource.
Create matches for each of the files that you want to read:
mypipeline.apply("Read files from GCS", FileIO.match().filepattern("gs://mybucket/myfilles/*.txt"))
Also, you can read like this if do not want Dataflow to throw exceptions when no file is found for your filePattern:
mypipeline.apply("Read files from GCS", FileIO.match().filepattern("gs://mybucket/myfilles/*.txt").withEmptyMatchTreatment(EmptyMatchTreatment.ALLOW))
Read your matches using FileIO:
.apply("Read file matches", FileIO.readMatches())
The above code returns a PCollection of the type FileIO.ReadableFile (PCollection<FileIO.ReadableFile>). Then you create a DoFn that process these ReadableFiles to meet your use case.
.apply("Process my files", ParDo.of(MyCustomDoFnToProcessFiles.create()))
You can read the entire documentation for FileIO here.
I need to make Jersey refuse requests with incorrect content-length. I'm checking the content-length with a ContainerRequestFilter filter, like so:
public class ContentLengthRequiredRequestFilter implements ContainerRequestFilter {
private static Logger LOG = LoggerFactory.getLogger(ContentLengthRequiredRequestFilter.class);
#Override
public void filter(ContainerRequestContext requestContext) throws IOException {
if (requestContext.getMethod() == javax.ws.rs.HttpMethod.POST
|| requestContext.getMethod() == javax.ws.rs.HttpMethod.PUT) {
int givenContentLength = requestContext.getLength();
if (givenContentLength == -1) {
// no content-length given, but it is is required for PUT and POST requests
requestContext.abortWith(Response.status(Response.Status.LENGTH_REQUIRED).entity("No content-length provided.").build());
} else {
// now check if the given content-length is actually correct.
// since I only have a reference to an entity stream, it seems to be that
// reading the entire stream and then resetting it is not a good solution.
// Should I be checking this somewhere else, perhaps somewhere the entity is already available or where I can get the total size of the body without causing the stream to be read twice? Or is there a better way to get the body size here?
}
}
}
}
As you can see in the code block comment, should I be checking this somewhere else, perhaps somewhere the entity is already available or where I can get the total size of the body without causing the stream to be read twice? Or is there a better way to get the body size there?
Thank you!!
If you don't want to buffer the incoming entity (input stream) I'd take a look at ReaderInterceptor interface. Instances of this contract are invoked only in cases when incoming requests contain an entity (typically POST, PUT). In the interceptor you can pretty much do anything with entity. A simple snippet (not covering all cases) could look like:
public class MyInterceptor implements ReaderInterceptor {
#Override
public Object aroundReadFrom(final ReaderInterceptorContext context) throws IOException, WebApplicationException {
final InputStream old = context.getInputStream();
final String first = context.getHeaders().getFirst("Content-Length");
final Long declared = first == null ? -1 : Long.valueOf(first);
context.setInputStream(new InputStream() {
private long length = 0;
private int mark = 0;
#Override
public int read() throws IOException {
final int read = old.read();
readAndCheck(read != -1 ? 1 : 0);
return read;
}
#Override
public int read(final byte[] b) throws IOException {
final int read = old.read(b);
readAndCheck(read != -1 ? read : 0);
return read;
}
#Override
public int read(final byte[] b, final int off, final int len) throws IOException {
final int read = old.read(b, off, len);
readAndCheck(read != -1 ? read : 0);
return read;
}
#Override
public long skip(final long n) throws IOException {
final long skip = old.skip(n);
readAndCheck(skip != -1 ? skip : 0);
return skip;
}
#Override
public int available() throws IOException {
return old.available();
}
#Override
public void close() throws IOException {
old.close();
}
#Override
public synchronized void mark(final int readlimit) {
mark += readlimit;
old.mark(readlimit);
}
#Override
public synchronized void reset() throws IOException {
this.length = 0;
readAndCheck(mark);
old.reset();
}
#Override
public boolean markSupported() {
return old.markSupported();
}
private void readAndCheck(final long read) {
this.length += read;
if (this.length > declared) {
throw new WebApplicationException(
Response.status(Response.Status.LENGTH_REQUIRED)
.entity("No content-length provided.")
.build());
}
}
});
final Object entity = context.proceed();
context.setInputStream(old);
return entity;
}
}
In the interceptor above I am setting my own input stream that counts and checks number of bytes read from the original input stream. This implementation however also depends on how the underlying container processes the input stream (i.e. if it also checks content-length when reading input stream).