Apache camel from ftp to database - java

Is it possible to solve following scenario with apache camel:
Read from ftp (periodically) retrieve a zip file which contains xml, store this xml in database.
The main question is which features exists in camel and which functionality and need to write on my own?

Yes, your route could look something like this (off the top of my head):
JaxbDataFormat jaxb = new JaxbDataFormat("com.example.foobar");
from("ftp://user:pass#server:21/inbox")
.unmarshal().zip()
.split(xpath("//foo"))
.unmarshal(jaxb)
.to("jpa:com.example.foobar.Foo")
This will poll a FTP server, unzip files, split the content in XML fragments, transform these to JPA entities and finally persist these objects in a database. There are many variations possible, depending on your use case you can omit the splitter EIP or for example choose another persistence mechanism (MyBatis, Spring-JDBC, etc).

Related

SDMX-ML: SAS libname XML

Eurostat data can be downloaded via a REST API. The response format of the API is a XML file formatted according to the SDMX-ML standard. With SAS, very conveniently, one can access XML files with the libname statement and the XML or XMLv2 engine.
Currently, I am using the xmlv2 engine together with the automap= option to generate an xmlmap to access the data. It works. But the resulting SAS data sets are very unstructured, and for another data set to be downloaded the data structure might change. Also the request might depend on the DSD-file that Eurostat provides for each database item within a different XML file.
Here comes the code:
%let path = /your/working/directory/;
filename map "&path.map.txt";
filename resp "&path.resp.txt";
proc http
URL="http://ec.europa.eu/eurostat/SDMX/diss-web/rest/data/cdh_e_fos/..PC.FOS1.BE/?startperiod=2005&endPeriod=2011"
METHOD="GET"
OUT=resp;
run;quit;
libname resp XMLv2 automap=REPLACE xmlmap=map;
proc datasets;
copy out=WORK in=resp;
run;quit;
With the code above, you can view all downloaded data in your WORK library. Its a mess.
To download another time series change parameters of the URL according to Eurostat's description.
So here is my question
Is there a way to easily generate a xmlmap from a call to the DSD file so that the data are stored in a well structured way?
As the SDMX-ML standard is widely used in public institutions such as the ECB, Eurostat, OECD... I am wondering if somebody has implemented requests to the databases, already. I know about the tool from Banca Italia which uses a javaObject. However, I was wondering if there might be a solution without the javaObject.

How can I create a route with multiple connected ftp calls using Camel and Java DSL?

I have this synchronous pipeline that need to be executed from time to time (lets say every 30 minutes):
Connect to a ftp;
Read a .json file (single file) from folder A;
Unmarshall the content of the file (Class A) and add it to the route context;
Read all the .fixedlenght files (multiple files) from folder B (preMove: processingFolder, move: doneFolder, moveFailed: errorFolder);
Unmarshall the content of the files (Class B) and do some logic;
Read all the .xml files (multiple files) from folder C (preMove: processingFolder, move: doneFolder, moveFailed: errorFolder);
Unmarshall the content of the files (Class C) and do some logic;
End the route.
It is a single pipeline created with Java DSL. If a error happen, the process stop.
I'm really struggling with Camel to create this. It is possible or I will need to handle this manually? I created some demos, but none of them are properly working.
Any help will be appreciated.
I would approach this in the following manner:
All the interfaces to the FTP where you read the files are separate routes. Their job is only to pick up the file. They don't deal with parsing or transformation.
Then create separate routes for actually receiving the data, parsing and transformation.
Finally the delivery routes which take the data and deliver to your end destination.
This way you can customise the error handling, easier to find out what went wrong were, makes it easier to change one part without affecting everything and you can reuse the routes in several different parts.
The way you describe your message pipeline it seems beneficial to have 3 separate routes each handling a different folder in your FTP server. You can have a timer that triggers all 3 every 30 minutes of so. The FTP component derives from Camel's File Component and there are a lot a useful parameters that would help with your routing logic here.
For each of your 3 routes you would have something like this:
from("ftp://foo#myserver?include=*.xml&preMove=processingFolder&move=doneFolder&moveFailed=errorFolder")
.unmarshal()
...
You can find more info about filtering files by their extensions here

Recursively scan documents for indexing in a folder in SolrJ

I understand that in SimplePostTool (post.jar), there is this command to automatically detect content types in a folder, and recursively scan it for documents for indexing into a collection:
bin/post -c gettingstarted afolder/
This has been useful for me to do mass indexing of all the files that are in the folder. Now that I'm moving to production and plans to use SolrJ to do the indexing as it can do more things like robustness checks and retires for indexes that fails.
However, I can't seems to find a way to do the same in SolrJ. Is it possible for this to be done in SolrJ? I'm using Solr 5.3.0
Thank you.
Regards,
Edwin
If you're looking to submit content to an extracting request handler (for indexing PDFs and similar rich documents), you can use the ContentStreamUpdateRequest method as shown at Uploading data with SolrJ:
SolrClient server = new HttpSolrClient("http://localhost:8983/solr/my_collection");
ContentStreamUpdateRequest req = new ContentStreamUpdateRequest("/update/extract");
req.addFile(new File("my-file.pdf"));
server.request(req);
To iterate through a directory structure recursively in Java, see Best way to iterate through a directory in Java.
If you're planning to index plain content (and not use the request handler), you can do that by creating the documents in SolrJ itself and then submitting the documents to the server - there's no need to write them to a temporary file in between.

Camel Use a splitter without aggregator

I'm new to Camel and I'd like to use it to read a XML file on a FTP server and to a assynch process for all NODE element of the XML.
Indeed, I'll use a splitter to process every node (I use a stream because the XML file is big).
from(ftp://user#host:port/...)
.split().tokenizeXML("node").streaming()
.to("seda:processNode")
.end();
Then the route to the nodeProcessor:
from("seda:processNode")
.bean(lookup(MyNodeProcessor.class))
.end();
I was wondering if it's ok to use a splitter without an aggregator? In my case, I don't need to aggregate the outcome of all processed nodes.
I was wondering if it's a problem in Camel to have many "splitted" threads going in a "dead end" instead of being aggreagated?
The examples provided by Camel show a splitter withtout an aggregator, but they still provide an aggregationStrategy with the splitter. Is it mandatory?
No this is perfect fine, you can use the splitter without the agg strategy which would be normal, like the splitter EIP: http://camel.apache.org/splitter
If you use an agg strategy then its more like this EIP: http://camel.apache.org/composed-message-processor.html which can be done with splitter only in Camel.

Document processing in Liferay portal

I've been using Liferay a lot for past 2 years, but I have never needed any extensive document management.
Now I have a portlet where users upload documents (MS office OLE2 documents, ODS documents, PDF etc.) and I have to persist them with all metadata available.
I know how would I do that without using Liferay, I'd probably use Apache solr with Apache Tika (UpdateRichDocuments and ExtractingRequestHandler) or Apache Jackrabbit that are using Apache Tika under the hood (org.apache.jackrabbit.extractor.*).
The problem is, that If I look at the trunk of Liferay, there are some key classes :
Hooks (JCRHook, FileSystemHook, CMISHook, s3Hook) that are employed from within DLLocalServiceImpl kinda directly
Another alternative is using DLAppLocalServiceImpl that is employing DLRepositoryLocalServiceImpl and the files are persisted into repository also via Hooks, but a lot of additional stuff is done in there.
There is not jackrabbit-text-extractors library in Liferay, so I suppose If I wanted metadata to be extracted from PDF, DOCs, ODS documents, I would have very hard times... because the DL service layer doesn't accept additional properties
I think I'd have to avoid using DL services and JCR hook and access Jackrabbit directly... But I would loose the compatibility and possibility migrate my repository etc.
Could please anybody collaborate on this one please ? Thank you
SOLR for indexing, Jackrabbit for document storage. Managing Liferay Document Library in code is fairly easy, just look at the DL*LocalServiceUtil classes, namely DLFolderLocalServiceUtil and DLFileLocalServiceUtil. By default Liferay just creates a matching folder/file structure on the hard drive (with names changed) so you'd only need to write code or use Jackrabbit if you wanted more than this since Liferay allows up/download and viewing out of the box via the control panel and various portlets.
I haven't used JackRabbit with Liferay but once configured everything should be managed under the covers and you shouldn't need to worry about it on the front end.
When you say "with all metadata available" I'm not sure what is retained, but aside from renaming the file so that it can be tracked there shouldn't be any other changes. It should be quick and easy to test by uploading a file of each type and checking the entries in the LIFERAY/data/document_library directory and subdirectories. Again this would be different if Jackrabbit is used.
those two services DLLocalServiceImpl and DLAppLocalServiceImpl both are and will, I suppose, important. The former one if for direct access to repository. Notice that when adding a file via this service you need to persist corresponding DlFileEntry into database and than reference that addFile(...., fileEntryId, ...).
The latter service is doing additional stuff for you, mainly asset management and workflow.
Regarding your use case, I would avoid using document library, because no metadata can go down into the JCR repository. Actually only metadata/custom properties that you could store would be custom properties AKA Expando feature of Liferay portal.
Best way for you seem to be implement your own jackrabbit hook to store data into repository and let Liferay document library use that repository.
Think Edgar is correct. If you check the current trunk via http://svn.liferay.com/repos/public/portal/trunk/portal-service/src/com/liferay/documentlibrary/service/DLLocalService.java (login as guest and no password), you will no longer find the class DLFolderLocalServiceUtil. We are using the existing DLFolderLocalServiceUtil class as well. Thanks for the heads up. We will refactor our code so when 6.1 comes around we can still use the DocumentLibrary services.
You need to always use DLAppServiceUtil ( as Liferay instructs specifically ). Here is my working code that saves a file to the CMS:
public static void saveFileToCMS(ActionRequest aReq, long groupId, String fileName, File filenameWithPath) {
try {
ServiceContext serviceContext = ServiceContextFactory.getInstance(
Group.class.getName(), aReq);
// prevents duplicate entries based on unique title name
Random rand = new Random();
Integer suffix = new Integer(rand.nextInt(10000));
DLAppServiceUtil.addFileEntry(groupId, 0, fileName, "application/vnd.ms-excel",
fileName + suffix.toString(), "description goes here", "changelogname",
filenameWithPath, serviceContext);
//log.info("Successfully added the new file");
} catch (PortalException pe) {
log.error("Portal Exception occurred while saving file to CMS");
pe.printStackTrace();
} catch (SystemException e) {
log.error("System Exception occurred while saving file to CMS");
e.printStackTrace();
}
}

Categories