We are using the Spring integration FileTailingMessageProducer (Apache Commons) for remotely tailing files and sending messages to rabbitmq.
Obviously when the java process that contains the file tailer is restarted, the information which lines have already been processed is lost. We would like to be able to restart the process and continue tailing at last line we had previously processed.
I guess we will have to keep this state either in a file on the host or a small database. The information stored in this file or db will probably be a simple map mapping file ids (file names will not suffice, since files may be rotated) to line numbers:
file ids -> line number
I am thinking about subclassing the ApacheCommonsFileTailingMessageProducer.
The java process will need to continually update this file or db. Is there a method for updating this file when the JVM exits?
Has anyone done this before? Are there any recommendations on how to proceed?
Spring Integration has an an abstraction MetadataStore - it's a simple key/value abstraction so would be perfect for this use case.
There are several implementations. The PropertiesPersistingMetadataStore persists to a properties file and, by default, only persists on an ApplicationContext close() (destroy()).
It implements Flushable so it can be flush()ed more often.
The other implementations (Redis, MongoDB, Gemfire) don't need flushing because the data is written immediately.
A subclass would work, the file tailer is a simple bean and can be declared as a <bean/> - there's no other "magic" done by the XML parser.
But, if you'd be interested in contributing it to the framework, consider adding the code to the adapter directly. Ideally, it would go in the superclass (FileTailingMessageProducerSupport) but I don't think we will have the ability to look at the file creation timestamp in the OSDelegatingFileTailingMessageProducer because we just get the line data streamed to us.
In any case, please open a JIRA Issue for this feature.
Related
I have some static configurations that will not be changed for each environment, for example, the mapping between client names and their id. In this case, should I store them in a Spring yml property file or in a database, eg. mongoDB, so that they can be easily accessed via Java code?
From the one side, consider that when you are adding a database component, you are adding additional potential point of failure to your app. What will happen if DB will not be accessible, for any reason ? ( crashed, under maintenance, network issues ) ?
From the second side, it depends how exactly your implementation will be using files. For example, if you will be adding items in your mapping between clients/ids, will you need to restart/rebuild/redeploy your app? How many running instances of your app will you have?
So, there are no one exact answer for all cases
It better keep in spring yaml instead of storing in any Database. Because calling the IO operations little expensive . Keeping static code in yaml or properties file will faster to access.
I want to add some temporary suffix to file while I am streaming the file from remote directory.
I am streaming the file from remote directory using spring integration dsl and I want to make sure that one file is getting read by single application at a time. So i am thinking to adding some temporary prefix to file while it is getting streamed. I am using outbound gateway to fetch the data.
Any pointers will be very helpful. Currently I am renaming the file before reading and after reading, I really don`t want to do that.
consider using file locking, instead of renaming. Here is the relevant part from the 13.2 Reading Files spring-integration documentation:
When multiple processes are reading from the same directory it can be desirable to lock files to prevent them from being picked up concurrently. To do this you can use a FileLocker. There is a java.nio based implementation available out of the box, but it is also possible to implement your own locking scheme. The nio locker can be injected as follows:
<int-file:inbound-channel-adapter id="filesIn"
directory="file:${input.directory}" prevent-duplicates="true">
<int-file:nio-locker/>
</int-file:inbound-channel-adapter>
A custom locker you can configure like this:
<int-file:inbound-channel-adapter id="filesIn"
directory="file:${input.directory}" prevent-duplicates="true">
<int-file:locker ref="customLocker"/>
</int-file:inbound-channel-adapter>
[Note]
When a file inbound adapter is configured with a locker, it will take the responsibility to acquire a lock before the file is allowed to be received. It will not assume the responsibility to unlock the file. If you have processed the file and keeping the locks hanging around you have a memory leak. If this is a problem in your case you should call FileLocker.unlock(File file) yourself at the appropriate time.
Please see the docs for Interface FileLocker and Class NioFileLocker for more info.
I would use the Apache Commons FileUtils. https://commons.apache.org/proper/commons-io/javadocs/api-2.5/org/apache/commons/io/FileUtils.html#moveFile
Typically what I do is write the file during initial transfer to a temporary working directory. After the file is completely transferred I do a checksum to guarantee the file is correct. At this point I move the file to the final directory being used by the other applications logic. As long as the working directory and the final directory are on the same filesystem the move will be atomic. This will guarantee no race conditions between the different parts of the application using the file.
I would like some information about the data flow in a Spring Batch processing but fail to find what I am looking for on the Internet (despite some useful questions on this site).
I am trying to establish standards to use Spring Batch in our company and we are wondering how Spring Batch behaves when several processors in a step updates data on different data sources.
This question focuses on a chunked process but feel free to provide information on other modes.
From what I have seen (please correct me if I am wrong), when a line is read, it follows the whole flow (reader, processors, writer) before the next is read (as opposed to a silo-processing where reader would process all lines, send them to the processor, and so on).
In my case, several processors read data (in different databases) and updates them in the process, and finally the writer inserts data into yet another DB. For now, the JobRepository is not linked to a database, but that would be an independent one, making the thing still a bit more complex.
This model cannot be changed since the data belongs to several business areas.
How is the transaction managed in this case? Is the data committed only once the full chunk is processed? And then, is there a 2-phase commit management? How is it ensured? What development or configuration should be made in order to ensure the consistency of data?
More generally, what would your recommendations be in a similar case?
Spring batch uses the Spring core transaction management, with most of the transaction semantics arranged around a chunk of items, as described in section 5.1 of the Spring Batch docs.
The transaction behaviour of the readers and writers depends on exactly what they are (eg file system, database, JMS queue etc), but if the resource is configured to support transactions then they will be enlisted by spring automatically. Same goes for XA - if you make the resource endpoint a XA compliant then it will utilise 2 phase commits for it.
Getting back to the chunk transaction, it will set up a transaction on chunk basis, so if you set the commit interval to 5 on a given tasklet then it will open and close a new transaction (that includes all resources managed by the transaction manager) for the set number of reads (defined as commit-interval).
But all of this is set up around reading from a single data source, does that meet your requirement? I'm not sure spring batch can manage a transaction where it reads data from multiple sources and writes the processor result into another database within a single transaction. (In fact I can't think of anything that could do that...)
I have just discovered that Apache commons-configuration can read properties from a DataSource, but it does not cache them. My application needs to read properties a lot of times and it is to slow to access the database each time.
I have a Camel application that sends all messages to routes that ends with my custom beans. These beans are created with scope prototype (I believe in OOP) and they will/need to read some properties and a data source (which reads from properties url/name/etc) that depends from the current user from a SQL db. Each message I receive creates a bean and so properties are reread.
Unfortunately, I am not free to choose where to read properties from because now there is another software (GUI) not written by me that is a User/properties manager that writes to db. So I need to read properties from it.
Can you suggest me an alternative?
You could use the Netflix Archaius project, which adds the caching behavior you are looking for as well as dynamic refresh capabilities. Archaius is built around Commons Configuration.
So, rather than subclassing the DatabaseConfiguration, you could use Archaius' DynamicConfiguration, which extends Commons' AbstractConfiguration. This class will cache whatever source you would like, and refresh the properties at an interval you specify using their poll scheduling class.
The only class you would have to implement is a PolledConfigurationSource which pulls data from the database and places it in a Map. Should be pretty simple.
https://github.com/Netflix/archaius/wiki/Users-Guide
I have a simple NOOP file consumer configured in camel as follows:
file:///tst?delay=10000&idempotent=false&include=fileMatch&noop=true
Normally the user running the camel application will not have write permissions to /tst, however does have read and write permissions to /tst/fileMatch. Unfortunately I'm finding that camel won't even poll for the file unless it has write permissions to /tst.
Is there a way around this?
Since the last answer, the camel file component had a relevant change:
Notice from Camel 2.10 onwards the read locks changed, fileLock and
rename will also use a markerFile as well, to ensure not picking up
files that may be in process by another Camel consumer running on
another node (eg cluster). This is only supported by the file
component (not the ftp component).
Hence, in Camel 2.10 or above, you still need write permission to use readLock=fileLock. You can use readLock=none, with the obvious impact that there will not be a read lock.
I shouldn't ask questions when I'm this tired. The reason that this doesn't work (as clearly stated in the component description) is that the default readLock strategy is markerFile (which needs to write the marker file in the directory). By changing this to readLock=fileLock I no longer need write permissions on the directory to read the file as the file system lock is placed on the file being read.
The working URI is:
file:///tst?delay=10000&idempotent=false&include=fileMatch&noop=true&readLock=fileLock