How to log the content of csv in Apache Camel? - java

I have the following code
DataFormat bindy = new BindyCsvDataFormat(Employee.class);
from("file:src/main/resources/csv2?noop=true").routeId("route3").unmarshal(bindy).to("mock:result").log("${body[0].name}");
I am trying to log every line of the csv file, currently I am only able to hardcode it to print.
Do I have to use Loop even I don't know the number of lines of the csv ? Or Do I have to use processor ? Whats the easiest way to achieve what I want ?

The unmarshalling step is producing an exchange whose body is a list of tuples. For that reason you can simply use Camel splitter to slice the original exchange into 1-N sub-exchanges (one per line/item of the list) and then log each of these lines:
from("file:src/main/resources/csv2?noop=true")
.unmarshal(bindy)
.split().body()
.log("${name}");
If you do not want to alter the original message, you can use the wiretap pattern in order to log a copy of the exchange:
from("file:src/main/resources/csv2?noop=true")
.unmarshal(bindy)
.wireTap("direct:logBody")
.to("mock:result");
from("direct:logBody")
.split().body()
.log("Row# ${exchangeProperty.CamelSplitIndex} : ${name}");

Related

Bindy skipping empty csv files

I built a camel route that unmarshals a csv file using camel bindy and writes the contents to a database. This works perfectly apart from when the file is empty, which by the looks of it will happen a lot. The file does contain csv headers, but no relevant data e.g.:
CODE;CATEGORY;PRICE;
In this case the following error is thrown:
java.lang.IllegalArgumentException: No records have been defined in the CSV
I tried adding allowEmptyStream = true to the bindy object that I use for unmarshalling. This however does not seem to do much, as the same error appears.
Any ideas on how to skip processing these empty files is very welcome.
In your use case, the option allowEmptyStream must be set to true at Bindy DataFormat level, as next:
BindyCsvDataFormat bindy = new BindyCsvDataFormat(SomeModel.class);
bindy.setAllowEmptyStream(true);
from("file:some/path")
.unmarshal(bindy)
// The rest of the route

Splitting, Aggregating then streaming writing on one big file using Apache Camel

I have a large database from which i load huge records. I process them in a batch mode using the splitter and agregator patterns.
The step where i'm stuck is the streaming of each batch to one json file where i want them all to be stored. Here are the steps :
Fetch records from DB
Process them as batchs of N
Each processed batch is written in a same big json file (missing step)
I tested this solution with the Append option from File2 but it does write multiples arrays in an an array. I could flatten this JSON but it takes me to one question.
How to stop the route from running knowing that i have two requirements :
After run the batch, the size at the start is not necessarly the same on in the end.
I tried to work with completionFromConsumer but does not work with quartz consumers.
I have this route :
from(endpointsURL))
.log(LoggingLevel.INFO, LOGGER, "Start fetching records")
.bean(DatabaseFetch, "fetch")
.split().method(InspectionSplittingStrategy.class, "splitItems")
.aggregate(constant(true), batchAggregationStrategy())
.completionPredicate(batchSizePredicate())
.completionTimeout(BATCH_TIME_OUT)
.log(LoggingLevel.INFO, LOGGER, "Start processing items")
.bean(ItemProcessor, "process")
.marshal()
.json(JsonLibrary.Jackson, true)
.setHeader(Exchange.FILE_NAME, constant("extract.json")))
.to("file:/json?doneFileName=${file:name}.done")
.log(LoggingLevel.INFO, LOGGER, "Processing done");
The problem here is as i supposed, my extract.json gets overwritten with every batch processed. I want to append every batch after an other.
I have no clue how to design and which pattern to use to make this possible. Stream and File have good features but in which fashion i can use them ?
You need to tell Camel to append to the file if it exists, add fileExists=Append as option to your file endpoint.
I changed the route strategy using only a splitting stategy :
from(endpointsURLs.get(START_AGENT))
.bean(databaseFetch, "fetch")
.split().method(SplittingStrategy.class, "splitItems")
.parallelProcessing()
.bean(databaseBatchExtractor, "launch")
.end()
.to("seda:generateExportFiles");
from("seda:generateExportFiles")
.bean(databaseFetch, "fetchPublications")
.multicast()
.parallelProcessing()
.to("direct:generateJson", "direct:generateCsv");
from("direct:generateJson")
.log("generate JSON file")
.marshal()
.json(JsonLibrary.Jackson, true)
.setHeader(Exchange.FILE_NAME, constant("extract.json")))
.to("file:/json?doneFileName=${file:name}.done")
.to("direct:notify");
from("direct:generateCsv")
.log("generate CSV file")
.bean(databaseFetch, "exportCsv")
.to("direct:notify");
from("direct:notify")
.log("generation done");
The important class SplittingStrategy :
public class SplittingStrategy {
private static final int BATCH_SIZE = 500;
private AtomicInteger counter = new AtomicInteger();
public Collection<List<Pair<Integer, Set<Integer>>>> splitItems(Map<Integer, Set<Integer>> itemsByID) {
List<Pair<Integer, Set<Integer>>> rawList = itemsByID.entrySet().stream()
.map((inspUA) -> new ImmutablePair<>(inspUA.getKey(), inspUA.getValue()))
.collect(Collectors.toList());
return rawList.parallelStream()
.collect(Collectors.groupingBy(pair -> counter.getAndIncrement() / BATCH_SIZE)).values();
}
}
With this strategy instead of using aggregate to re-assemble items. I embeeded the aggregation strategy as part of the splitting :
Transform my hashmap into an Iterable List> to be returned by the split method (c.f Splitter with POJO)
Split items in batches of 500 items size with a groupingBy of my initial list stream.
Give a comment or your opinion about it!

Apache Beam - Reading JSON and Stream

I am writing Apache beam code, where I have to read a JSON file which has placed in the project folder, and read the data and Stream it.
This is the sample code to read JSON. Is this correct way of doing it?
PipelineOptions options = PipelineOptionsFactory.create();
options.setRunner(SparkRunner.class);
Pipeline p = Pipeline.create(options);
PCollection<String> lines = p.apply("ReadMyFile", TextIO.read().from("/Users/xyz/eclipse-workspace/beam-prototype/test.json"));
System.out.println("lines: " + lines);
or I should use,
p.apply(FileIO.match().filepattern("/Users/xyz/eclipse-workspace/beam-prototype/test.json"))
I just need to read the below json file. Read the complete testdata from this file and then Stream it.
{
“testdata":{
“siteOwner”:”xxx”,
“siteInfo”:{
“siteID”:”id_member",
"siteplatform”:”web”,
"siteType”:”soap”,
"siteURL”:”www”,
}
}
}
The above code is not reading the json file, it is printing like
lines: ReadMyFile/Read.out [PCollection]
, could you please guide me with sample reference?
This is the sample code to read JSON. Is this correct way of doing it?
To quickly answer your question, yes. Your sample code is the correct way to read a file containing JSON, where each line of the file contains a single JSON element. The TextIO input transform reads a file line by line, so if a single JSON element spans multiple lines, then it will not be parseable.
The second code sample has the same effect.
The above code is not reading the json file, it is printing like
The printed result is expected. The variable lines does not actually contain the JSON strings in the file. lines is a PCollection of Strings; it simply represents the state of the pipeline after a transform is applied. Accessing elements in the pipeline can be done by applying subsequent transforms. The actual JSON string can be access in the implementation of a transform.

Efficient Camel Content Based Router: Route XML messages to the correct recipient based on contained tag with Java DSL

The problem:
I need to process different huge XML files. Each file contains a certain node which I can use to identify the incoming xml message by. Based on the node/tag the message should be send to a dedicated recipient.
The XML message should not be converted to String and then checked with contains as this would be really inefficient. Rather xpath should be used to "probe" the message for the occurrence of the expected node.
The solution should be based on camel's Java DSL. The code:
from("queue:foo")
.choice().xpath("//foo")).to("queue:bar")
.otherwise().to("queue:others");
suggested in Camel's Doc does not compile. I am using Apache Camel 2.19.0.
This compiles:
from("queue:foo")
.choice().when(xpath("//foo"))
.to("queue:bar")
.otherwise()
.to("queue:others");
You need the .when() to test predicate expressions when building a content-based-router.

Talend iterate on tTikaExtractor

I'm trying to use tTikaExtractor component to extract the content of several files in a folder.
It is working with a single file but when I add a tFileList component, I don't understand how to get the content of the 2 different files.
I think it is something related to flow/iterations but I cannot manage to make it work.
For example, I have this simple job :
tFileList -(iterate)-> tTikaExtractor -(onComponentOk)-> tJava -(row1)-> tFileOutputJSON
In my java component I only have this :
String content = (String) globalMap.get("tTikaExtractor_1_CONTENT");
row1.content=content;
But in my json output I only the content of the last file and not of all files !
Can you help me on this ?
That because you are not appending records to the output it is writing records one by one so eventually only last record is available in file.
Perhaps you can write all the rows to delimited file first then use tFileInputDelimited--main--tFileOutputJSON
to transfer all the rows.

Categories