Camel Split InputStream by length not by token - java

I have an input file like this
1234AA11BB4321BS33XY...
and I want to split it into single messages like this
Message 1: 1234AA11BB
Message 2: 4321BS33XY
transform the records into Java objects, marshal them to xml with jaxb and aggregate about 1000 records in the outgoing Message.
Transformation and marshalling is no problem but I can't split the String above.
There is no delimiter but the length. Every Record is exactly 10 characters long.
I was wondering if there is an out of the box solution like
split(body().tokenizeBySize(10)).streaming()
Since in reality each record consists of 300 characters and there may be 500.000 records in a file, I want to split an InputStream.
In other examples I saw custom iterators used for splitting but all of them where token or xml based.
Any idea?
By the way we are bound to Java 6 and camel 2.13.4
Thanks
Nick

The easiest way would be to split by empty string - .split().tokenize("", 10).streaming() - meaning that tokenizer will take each character - and group 10 tokens (characters) together and then aggregate them into a single group e.g.
#Override
public void configure() throws Exception {
from("file:src/data?delay=3000&noop=true")
.split().tokenize("", 10).streaming()
.aggregate().constant(true) // all messages have the same correlator
.aggregationStrategy(new GroupedMessageAggregationStrategy())
.completionSize(1000)
.completionTimeout(5000) // use a timeout or a predicate
// to know when to stop
.process(new Processor() { // process the aggregate
#Override
public void process(final Exchange e) throws Exception {
final List<Message> aggregatedMessages =
(List<Message>) e.getIn().getBody();
StringBuilder builder = new StringBuilder();
for (Message message : aggregatedMessages) {
builder.append(message.getBody()).append("-");
}
e.getIn().setBody(builder.toString());
}
})
.log("Got ${body}")
.delay(2000);
}
EDIT
Here's my memory consumption in streaming mode with 2s delay for a 100MB file:

Why not let a normal java class do the splitting and refer to it? See here:
http://camel.apache.org/splitter.html
Code example taken from the documentation.
The below java dsl uses the "method" to call the split method defined in a separate class.
from("direct:body")
// here we use a POJO bean mySplitterBean to do the split of the payload
.split().method("mySplitterBean", "splitBody")
Below you define your splitter and return each split message.
public class MySplitterBean {
/**
* The split body method returns something that is iteratable such as a java.util.List.
*
* #param body the payload of the incoming message
* #return a list containing each part splitted
*/
public List<String> splitBody(String body) {
// since this is based on an unit test you can of cause
// use different logic for splitting as Camel have out
// of the box support for splitting a String based on comma
// but this is for show and tell, since this is java code
// you have the full power how you like to split your messages
List<String> answer = new ArrayList<String>();
String[] parts = body.split(",");
for (String part : parts) {
answer.add(part);
}
return answer;
}

Related

Apache Camel: Remove specific lines from a file without checking each line

In my algorithm, I read a large file line by line (just a simple .txt format) and transform each line of the file into an object.
#Override
public void configure() {
from("file:from/")
.split(body().tokenize("\n"))
.streaming()
.process(handle());
}
private Processor handle() {
return exchange -> {
final String body = exchange.getIn().getBody(String.class);
// convert to DTO
System.out.println(dto);
};
}
But the file contains the first and last lines, which should be removed. These lines start with \\test.
My question is: how can I delete these lines using the Apache Camel API, without check for each line for equality to this value \\test?
I don't want to do something like this for each file line (pseudocode):
if (getFirstStringСharacter().equals("\\test") {
removeString();
}
Perhaps the Apache Camel before starting to read the file can do preliminary actions and simply ignore the first and last lines.
The split EIP is producing (among others) two interesting exchange properties on each Exchange that are split:
CamelSplitIndex
CamelSplitComplete
Assuming the line "//test" is always present in the first and the last line, your processor (handle) could skip the processing
when CamelSplitIndex==0 [first line]
OR when CamelSplitComplete is true [last line]
Example: skip first line
from("...")
.split(body().tokenize("\n"))
.streaming()
.filter( simple("${exchangeProperty.CamelSplitIndex} > 0") )
.process( handle() );
To answer your last question:
.filter( simple("${exchangeProperty.CamelSplitComplete} == false") )
In case of +/- complex condition, I recommend the use of Camel Predicate, eg:
import org.apache.camel.support.builder.PredicateBuilder;
Predicate isNotFirst = PredicateBuilder.isGreaterThan( exchangeProperty("CamelSplitIndex"), constant(0) );
Predicate isNotLast = PredicateBuilder.isNotEqualTo( exchangeProperty("CamelSplitComplete"), constant(true) );
Predicate retained = PredicateBuilder.and(isNotFirst, isNotLast);
from("...")
.filter(retained)

How can I read this unstructured flat file in Java?

I have text file
Now I am trying to read this into a two dimension array .
anyone with an example code or question which was answered ?
Consider this file divided in middle present two record in same format, you need to design class that contains fields that you want to get from this file. After that you need to read
List<String> fileLines = Files.readAllLines(Path pathToYourFile, Charset cs);
and parse this file with help of regular expressions. To simplify this task you may read lines and after that specify regexp per line.
class UnstructuredFile {
private List<String> rawLines;
public UnstructuredFile (List<String> rawLines) {
this.rawLines = rawLines;
}
public List<FileRecord> readAllRecords() {
//determine where start and stop one record in list list.sublist(0,5) or split it to List<List<String>>
}
private FileRecord readOneRecord(List<String> record) {
//read one record from list
}
}
in this class we first detect start and end of every record and after that pass it to method that parse one FileRecord from List
Maybe you need to decouple you task even more, consider you have one record
------
data 1
data 2
data 3
------
we make to do classes RecordRowOne, RecordRowTwo etc. every class have regex that know how
to parse particular line of row of the record string and returns partucular results like
RecordRowOne {
//fields
public RecordRowOne(String regex, String dataToParse) {
//code
}
int getDataOne() {
//parse
}
}
another row class in example has methods like
getDataTwo();
after you create all this row classes pass them to FileRecord class
that get data from all Row classes and it will be present one record of you file;
class FileRecord {
//fields
public FileRecord(RecordRowOne one, RecordRowTwo two) {
//get all data from rows and set it to fields
}
//all getters for fields
}
it is basic idea for you

How do I aggregate file content correctly with Apache Camel?

I am writing a tool to parse some very big files, and I am implementing it using Camel. I have used Camel for other things before and it has served me well.
I am doing an initial Proof of Concept on processing files in streaming mode, because if I try to run a file that is too big without it, I get a java.lang.OutOfMemoryError.
Here is my route configuration:
#Override
public void configure() throws Exception {
from("file:" + from)
.split(body().tokenize("\n")).streaming()
.bean(new LineProcessor())
.aggregate(header(Exchange.FILE_NAME_ONLY), new SimpleStringAggregator())
.completionTimeout(150000)
.to("file://" + to)
.end();
}
from points to the directory where my test file is.
to points to the directory where I want the file to go after processing.
With that approach I could parse files that had up to hundreds of thousands of lines, so it's good enough for what I need. But I'm not sure the file is being aggregated correctly.
If i run cat /path_to_input/file I get this:
Line 1
Line 2
Line 3
Line 4
Line 5
Now on the output directory cat /path_to_output/file I get this:
Line 1
Line 2
Line 3
Line 4
Line 5%
I think this might be a pretty simple thing, although I don't know how to solve this. both files have slightly different byte sizes as well.
Here is my LineProcessor class:
public class LineProcessor implements Processor {
#Override
public void process(Exchange exchange) throws Exception {
String line = exchange.getIn().getBody(String.class);
System.out.println(line);
}
}
And my SimpleStringAggregator class:
public class SimpleStringAggregator implements AggregationStrategy {
#Override
public Exchange aggregate(Exchange oldExchange, Exchange newExchange) {
if(oldExchange == null) {
return newExchange;
}
String oldBody = oldExchange.getIn().getBody(String.class);
String newBody = newExchange.getIn().getBody(String.class);
String body = oldBody + "\n" + newBody;
oldExchange.getIn().setBody(body);
return oldExchange;
}
}
Maybe I shouldn't even worry about this, but I would just like to have it working perfectly since this is just a POC before I get to the real implementation.
It looks like your input files last character is a line break. You split up the file with \n and add it back in the aggregator except for the last line. Because there is no new line left the line terminator \n is removed from the last line. One solution might by adding the \n in advance:
String body = oldBody + "\n" + newBody + "\n";
The answer from 0X00me is probably correct however you are doing unneeded work probably.
I assume you are using a version of camel higher than 2.3. In which case you can drop the aggregation implementation completely as according to the camel documentation:
Camel 2.3 and newer:
The Splitter will by default return the original input message.
Change your route to something like this(I cant test it):
#Override
public void configure() throws Exception {
from("file:" + from)
.split(body().tokenize("\n")).streaming()
.bean(new LineProcessor())
.completionTimeout(150000)
.to("file://" + to)
.end();
}
If you need to do custom aggregation then you need to implement the aggregator. I process files this way daily and always end with exactly what I started with.

Spring REST Controller understanding arrays of strings when having special characters like blank spaces or commas

I am trying to write a Spring REST Controller getting an array of strings as input parameter of a HTTP GET request.
The problem arises when in the GET request, in some of the strings of the array, I use special characters like commas ,, blank spaces or forward slash /, no matter if I URL encode the query part of the URL HTTP GET request.
That means that the string "1/4 cup ricotta, yogurt" (edit which needs to be considered as a unique ingredient contained as a string element of the input array) in either this format:
http://127.0.0.1:8080/[...]/parseThis?[...]&ingredients=1/4 cup ricotta, yogurt
This format (please note the blank spaces encoded as + plus, rather than the hex code):
http://127.0.0.1:8080/[...]/parseThis?[...]&ingredients=1%2F4+cup+ricotta%2C+yogurt
Or this format (please note the blank space encoded as hex code %20):
http://127.0.0.1:8080/[...]/parseThis?[...]&ingredients=1%2F4%20cup%20ricotta%2C%20yogurt
is not rendered properly.
The system does not recognize the input string as one single element of the array.
In the 2nd and 3rd case the system splits the input string on the comma and returns an array of 2 elements rather than 1 element. I am expecting 1 element here.
The relevant code for the controller is:
#RequestMapping(
value = "/parseThis",
params = {
"language",
"ingredients"
}, method = RequestMethod.GET, headers = HttpHeaders.ACCEPT + "=" + MediaType.APPLICATION_JSON_VALUE)
#ResponseBody
public HttpEntity<CustomOutputObject> parseThis(
#RequestParam String language,
#RequestParam String[] ingredients){
try {
CustomOutputObject responseFullData = parsingService.parseThis(ingredients, language);
return new ResponseEntity<>(responseFullData, HttpStatus.OK);
} catch (Exception e) {
// TODO
}
}
I need to perform HTTP GET request against this Spring controller, that's a requirement (so no HTTP POST can be used here).
Edit 1:
If I add HttpServletRequest request to the signature of the method in the controller, then I add a log statement like log.debug("The query string is: '" + request.getQueryString() + "'"); then I am seeing in the log a line like The query string is: '&language=en&ingredients=1%2F4+cup+ricotta%2C+yogurt' (So still URL encoded).
Edit 2:
On the other hand if I add WebRequest request to the signature of the method, the the log as log.debug("The query string is: '" + request.getParameter("ingredients") + "'"); then I am getting a string in the log as The query string is: '1/4 cup ricotta, yogurt' (So URL decoded).
I am using Apache Tomcat as a server.
Is there any filter or something I need to add/review to the Spring/webapp configuration files?
Edit 3:
The main problem is in the interpretation of a comma:
#ResponseBody
#RequestMapping(value="test", method=RequestMethod.GET)
public String renderTest(#RequestParam("test") String[] test) {
return test.length + ": " + Arrays.toString(test);
// /app/test?test=foo,bar => 2: [foo, bar]
// /app/test?test=foo,bar&test=baz => 2: [foo,bar, baz]
}
Can this behavior be prevented?
The path of a request parameter to your method argument goes through parameter value extraction and then parameter value conversion. Now what happens is:
Extraction:
The parameter is extracted as a single String value. This is probably to allow simple attributes to be passed as simple string values for later value conversion.
Conversion:
Spring uses ConversionService for the value conversion. In its default setup StringToArrayConverter is used, which unfortunately handles the string as comma delimited list.
What to do:
You are pretty much screwed with the way Spring handles single valued request parameters. So I would do the binding manually:
// Method annotations
public HttpEntity<CustomOutputObject> handlerMethod(WebRequest request) {
String[] ingredients = request.getParameterValues("ingredients");
// Do other stuff
}
You can also check what Spring guys have to say about this.. and the related SO question.
Well, you could register a custom conversion service (from this SO answer), but that seems like a lot of work. :) If it were me, I would ignore the declaration the #RequestParam in the method signature and parse the value using the incoming request object.
May I suggest you try the following format:
ingredients=egg&ingredients=milk&ingredients=butter
Appending &ingredients to the end will handle the case where the array only has a single value.
ingredients=egg&ingredients=milk&ingredients=butter&ingredients
ingredients=milk,skimmed&ingredients
The extra entry would need to be removed from the array, using a List<String> would make this easier.
Alternatively if you are trying to implement a REST controller to pipe straight into a database with spring-data-jpa, you should take a look at spring-data-rest. Here is an example.
You basically annotate your repository with #RepositoryRestResource and spring does the rest :)
A solution from here
public String get(WebRequest req) {
String[] ingredients = req.getParameterValues("ingredients");
for(String ingredient:ingredients ) {
System.out.println(ingredient);
}
...
}
This works for the case when you have a single ingredient containing commas

Using camel to aggregate messages of same header

I have multiple clients that send files to a server. For one set of data there are two files that contain information about that data, each with the same name. When a file is received, the server sends a message out to my queue containing the file path, file name, ID of the client, and the "type" of file it is (all have same file extension but there are two "types," call them A and B).
The two files for one set of data have the same file name. As soon as the server has received both of the files I need to start a program that combines the two. Currently I have something that looks like this:
from("jms:queue.name").aggregate(header("CamelFileName")).completionSize(2).to("exec://FILEPATH?args=");
Where I am stuck is the header("CamelFileName"), and more specifically how the aggregator works.
With the completionSize set to 2 does it just suck up all the messages and store them in some data structure until a second message that matches the first comes through? Also, does the header() expect a specific value? I have multiple clients so I was thinking of having the client ID and the file name in the header, but then again I don't know if I have to give a specific value. I also don't know if I can use a regex or not.
Any ideas or tips would be super helpful.
Thanks
EDIT:
Here is some code I have now. Based on my description of the problem here and in comments on selected answer does it seem accurate (besides close brackets that I didn't copy over)?
public static void main(String args[]) throws Exception{
CamelContext c = new DefaultCamelContext();
c.addComponent("activemq", activeMQComponent("vm://localhost?broker.persistent=false"));
//ActiveMQConnectionFactory connectionFactory = new ActiveMQConnectionFactory("vm://localhost?broker.persistent=false");
//c.addComponent("jms", JmsComponent.jmsComponentAutoAcknowledge(connectionFactory));
c.addRoutes(new RouteBuilder() {
public void configure() {
from("activemq:queue:analytics.camelqueue").aggregate(new MyAggregationStrategy()).header("subject").completionSize(2).to("activemq:queue:analytics.success");
}
});
c.start();
while (true) {
System.out.println("Waiting on messages to come through for camel");
Thread.sleep(2 * 1000);
}
//c.stop();
}
private static class MyAggregationStrategy implements AggregationStrategy {
public Exchange aggregate(Exchange oldExchange, Exchange newExchange) {
if (oldExchange == null)
return newExchange;
// and here is where combo stuff goes
String oldBody = oldExchange.getIn().getBody(String.class);
String newBody = newExchange.getIn().getBody(String.class);
boolean oldSet = oldBody.contains("set");
boolean newSet = newBody.contains("set");
boolean oldFlow = oldBody.contains("flow");
boolean newFlow = newBody.contains("flow");
if ( (oldSet && newFlow) || (oldFlow && newSet) ) {
//they match so return new exchange with info so extractor can be started with exec
String combined = oldBody + "\n" + newBody + "\n";
newExchange.getIn().setBody(combined);
return newExchange;
}
else {
// no match so do something....
return null;
}
}
}
you must supply an AggregationStrategy to define how you want to combine Exchanges...
if you are only interested in the fileName and receiving exactly 2 Exchanges, then you can just use the UseLatestAggregationStrategy to just pass the newest Exchange through once 2 have been 'aggregated'...
that said, it sounds like you need to retain both Exchanges (one for each clientId) so you can pass that info on to the 'exec' step...if so, you can just combine the Exchanges into a GroupedExchange holder using the built-in aggregation strategy enabled via the groupExchanges option...or specificy a custom AggregationStrategy to combine them however you'd like. just need to keep in mind that your 'exec' step needs to handle whatever aggregated structure you decide to use...
see these unit tests for examples:
https://svn.apache.org/repos/asf/camel/trunk/camel-core/src/test/java/org/apache/camel/processor/aggregator/AggregatorTest.java
https://svn.apache.org/repos/asf/camel/trunk/camel-core/src/test/java/org/apache/camel/processor/aggregator/AggregateGroupedExchangeTest.java

Categories