We are using the spring integration sftp:inbound-channel-adapter to transfer data from a remote host. We would like to keep the files on the remote host. Thus we tried with the
delete-remote-files=false option.
<int-sftp:inbound-channel-adapter
id="sftpInboundChannelAdapter"
channel="filesToParse"
session-factory="..."
remote-directory="..."
filename-pattern="..."
local-directory="..."
temporary-file-suffix=".tmp"
delete-remote-files="false"
auto-create-local-directory="true" local-filter="localFileFilter"
>
Unfortunately these files are then processed multiple times. Is there a way of keeping the remote files and not processing them multiple times?
EDIT: this is because a subsequent process deletes the file on the local side.
<bean id="localFileFilter" class="org.springframework.integration.file.filters.AcceptAllFileListFilter"/>
Note that the AcceptOnceFileListFilter (which is in fact the default), will only prevent duplicates for the current execution; it keeps its state in memory.
To avoid duplicates across executions, you should use a FileSystemPersistentAcceptOnceFileListFilter configured with an appropriate metadata store.
Note that the PropertiesPersistingMetadataStore only persists its state to disk on an normal application context shutdown (close), so the most robust solution is Redis, or MongoDB (or your own implementation of ConcurrentMetadataStore).
You can also call flush() on the PropertiesPersistingMetadataStore from time-to-time (or within the flow).
I changed the filter: it now only retrieves them once.
<bean id="localFileFilter" class="org.springframework.integration.file.filters.AcceptOnceFileListFilter"/>
Related
I want to add some temporary suffix to file while I am streaming the file from remote directory.
I am streaming the file from remote directory using spring integration dsl and I want to make sure that one file is getting read by single application at a time. So i am thinking to adding some temporary prefix to file while it is getting streamed. I am using outbound gateway to fetch the data.
Any pointers will be very helpful. Currently I am renaming the file before reading and after reading, I really don`t want to do that.
consider using file locking, instead of renaming. Here is the relevant part from the 13.2 Reading Files spring-integration documentation:
When multiple processes are reading from the same directory it can be desirable to lock files to prevent them from being picked up concurrently. To do this you can use a FileLocker. There is a java.nio based implementation available out of the box, but it is also possible to implement your own locking scheme. The nio locker can be injected as follows:
<int-file:inbound-channel-adapter id="filesIn"
directory="file:${input.directory}" prevent-duplicates="true">
<int-file:nio-locker/>
</int-file:inbound-channel-adapter>
A custom locker you can configure like this:
<int-file:inbound-channel-adapter id="filesIn"
directory="file:${input.directory}" prevent-duplicates="true">
<int-file:locker ref="customLocker"/>
</int-file:inbound-channel-adapter>
[Note]
When a file inbound adapter is configured with a locker, it will take the responsibility to acquire a lock before the file is allowed to be received. It will not assume the responsibility to unlock the file. If you have processed the file and keeping the locks hanging around you have a memory leak. If this is a problem in your case you should call FileLocker.unlock(File file) yourself at the appropriate time.
Please see the docs for Interface FileLocker and Class NioFileLocker for more info.
I would use the Apache Commons FileUtils. https://commons.apache.org/proper/commons-io/javadocs/api-2.5/org/apache/commons/io/FileUtils.html#moveFile
Typically what I do is write the file during initial transfer to a temporary working directory. After the file is completely transferred I do a checksum to guarantee the file is correct. At this point I move the file to the final directory being used by the other applications logic. As long as the working directory and the final directory are on the same filesystem the move will be atomic. This will guarantee no race conditions between the different parts of the application using the file.
From this article we can learn that Spring-Batch holds the Job's status in some SQL repository.
And from this article we can learn that the location of the JobRepository can be configured - can be in-memory and can be remote DB.
So if we need to scale a batch job, should we run several different Spring-batch JARs, all configured to use the same shared DB in order to keep them synchronized?
Is this the right pattern / architecture?
Yes, this is the way to go. The problem that might happen when you launch the same job from different physical nodes is that you can create the same job instance twice. In this case, Spring Batch will not know which instance to pick up when restarting a failed execution. A shared job repository acts as a safeguard to prevent this kind of concurrency issues.
The job repository achieves this synchronization thanks to the transactional capabilities of the underlying database. The IsolationLevelForCreate can be set to an aggressive value (SERIALIZABLE is the default) in order to avoid the aforementioned issue.
We are using the Spring integration FileTailingMessageProducer (Apache Commons) for remotely tailing files and sending messages to rabbitmq.
Obviously when the java process that contains the file tailer is restarted, the information which lines have already been processed is lost. We would like to be able to restart the process and continue tailing at last line we had previously processed.
I guess we will have to keep this state either in a file on the host or a small database. The information stored in this file or db will probably be a simple map mapping file ids (file names will not suffice, since files may be rotated) to line numbers:
file ids -> line number
I am thinking about subclassing the ApacheCommonsFileTailingMessageProducer.
The java process will need to continually update this file or db. Is there a method for updating this file when the JVM exits?
Has anyone done this before? Are there any recommendations on how to proceed?
Spring Integration has an an abstraction MetadataStore - it's a simple key/value abstraction so would be perfect for this use case.
There are several implementations. The PropertiesPersistingMetadataStore persists to a properties file and, by default, only persists on an ApplicationContext close() (destroy()).
It implements Flushable so it can be flush()ed more often.
The other implementations (Redis, MongoDB, Gemfire) don't need flushing because the data is written immediately.
A subclass would work, the file tailer is a simple bean and can be declared as a <bean/> - there's no other "magic" done by the XML parser.
But, if you'd be interested in contributing it to the framework, consider adding the code to the adapter directly. Ideally, it would go in the superclass (FileTailingMessageProducerSupport) but I don't think we will have the ability to look at the file creation timestamp in the OSDelegatingFileTailingMessageProducer because we just get the line data streamed to us.
In any case, please open a JIRA Issue for this feature.
I would like some information about the data flow in a Spring Batch processing but fail to find what I am looking for on the Internet (despite some useful questions on this site).
I am trying to establish standards to use Spring Batch in our company and we are wondering how Spring Batch behaves when several processors in a step updates data on different data sources.
This question focuses on a chunked process but feel free to provide information on other modes.
From what I have seen (please correct me if I am wrong), when a line is read, it follows the whole flow (reader, processors, writer) before the next is read (as opposed to a silo-processing where reader would process all lines, send them to the processor, and so on).
In my case, several processors read data (in different databases) and updates them in the process, and finally the writer inserts data into yet another DB. For now, the JobRepository is not linked to a database, but that would be an independent one, making the thing still a bit more complex.
This model cannot be changed since the data belongs to several business areas.
How is the transaction managed in this case? Is the data committed only once the full chunk is processed? And then, is there a 2-phase commit management? How is it ensured? What development or configuration should be made in order to ensure the consistency of data?
More generally, what would your recommendations be in a similar case?
Spring batch uses the Spring core transaction management, with most of the transaction semantics arranged around a chunk of items, as described in section 5.1 of the Spring Batch docs.
The transaction behaviour of the readers and writers depends on exactly what they are (eg file system, database, JMS queue etc), but if the resource is configured to support transactions then they will be enlisted by spring automatically. Same goes for XA - if you make the resource endpoint a XA compliant then it will utilise 2 phase commits for it.
Getting back to the chunk transaction, it will set up a transaction on chunk basis, so if you set the commit interval to 5 on a given tasklet then it will open and close a new transaction (that includes all resources managed by the transaction manager) for the set number of reads (defined as commit-interval).
But all of this is set up around reading from a single data source, does that meet your requirement? I'm not sure spring batch can manage a transaction where it reads data from multiple sources and writes the processor result into another database within a single transaction. (In fact I can't think of anything that could do that...)
I have a requirement to cache xml bean java objects by reading xml’s from database . I am using a HashMap in memory to maintain my java objects. I am using spring for DI and Weblogic 11g app server.
Can you please suggest me a mechanism to reload cache when there is an update in xml files.
You can make use of weblogic p13n cache for this purpose, instead of using your own HashMap to cache the java objects. You will have to configure p13n-cache-config.xml file, which contains, TTL, max value etc. for your cache.
Coming to the first point, the cache will be automatically reloaded after the TTL time is done with. For manually clearing cache, You can implement a Servlet kind of thing, which you can hit directly from your browser (can restrict it for a particular URL). In that servlet clear the cache which you want to reload.
weblogic p13n cache provides you method for cluster aware cache clear as well, in case you need it, in case you want to use your own HashMap for caching, provide a update method for that HashMap and clear the java objects that you want to be reloaded and then call the cache creation method.