Drools for large volume of data

Drools for large volume of data - java

We have a requirement where we need to process about 5MM messages in a day and based on certain business rules, generate a unique identifier for messages received asynchronously.
Use case:-
System received message A, message B, message C and message D (standard xml format for all message types).
Business Rule :- If message A contains tag <tag1> and value of tag matches against value of either of <tag2> , <tag3>, <tag4> of message B, C or D; assign an identifier assigned for first match. If none matches, generate new identifer and assign to message A.
Similiar rules applies for message B, C or D.
We thought of using Drools Engine implementation to support above use case but not sure if it will work of such huge amount of data and processed near real time.
Has anyone used Drools Engine to process large amount of data and if so, can you please share the issues or statistical data around the same.

For simple rules that just check 4 conditions your describe Drools will perform more than fast enough. Just make sure you compile rools just once and not every rule execution. You should likely see performance in order of about few 100_000 of rule invocations per minute in hot state against simple rules like you describe above.
Take a look at these benchmarks for to get better idea:
https://github.com/winklerm/phreak-examples/tree/master/benchmark

Related

Declaring configuration of custom configurable application in java?

So for a hobby project of mine, I would like to create an application that translates an HTTP call and request between two services.
The application does that based on a configuration that can be set by the user. The idea is that the application listens to an incoming API call translates the call and then forwards it.
Then the application waits for a response then translates the response and sends it back to the caller.
A translation can be as simple as renaming a field value in a body object or replace a header field to the body.
I think a translation should begin with mapping the correct URL so here is an example of what I was thinking of a configuration should look like:
//request mapping
incoming URL = outgoing URL(
//Rename header value
header.someobject.renameto = "somevalue"
//Replace body object to header
body.someobject.replaceto.header
)
I was thinking that the configuration should be placed in a .txt file and read by the application.
My question is, are there other similar systems that use a configuration file for a configuration like this? And are there other/better ways to declare a configuration?

I have done something sort-of-similar in a different context (generate code from an input specification), so I will provide an outline of what I did to provide some food for thought. I used Config4* (disclosure: I developed that). If the approach I describe below is of interest to you, then I suggest you read Chapters 2 and 3 of the Config4* Getting Started Guide to get an overview of the Config4* syntax and API. Alternatively, express the concepts below in a different configuration syntax, such as XML.
Config4* is a configuration syntax, and the subset of syntax relevant to this discussion is as follows:
# this is a comment
name1 = "simple value";
name2 = ["a", "list of", "values"];
# a list can be laid out in columns to simulate a table of information
name3 = [
# item colour
#------------------
"car", "red",
"jeans", "blue",
"roses", "red",
];
In a code generator application, I used a table to provide rules to specify how to generate code for assigning values to fields of messages. If no rule was specified for a particular field, then some built-in rules provided default behaviour. The table looked something like the following:
field_rules = [
# wildcarded message.field instruction
#----------------------------------------------------------------
"Msg1.username", "#config:username",
"Msg1.password", "#config:password",
"Msg3.price", "#order:price",
"*.account", "#string:foobar",
"*.secondary_account", "#ignore",
"*.heartbeat_interval", "#expr:_heartbeatInterval * 1000",
"*.send_timestamp", "#now",
];
When my code generator wanted to generate code to assign a value to a field, the code generator constructed a string of the form "<message-name>.<field-name>", for example, Msg3.price. Then it examined the field_rules table line-by-line (starting from the top) to find a line in which the first column matched "<message-name>.<field-name>". The matching logic permitted * as a wildcard character that could match zero or more characters. (Conveniently, Config4* provides a patternMatch() utility operation that provides this functionality.)
If a match was found, then the value in the instruction column told the code generator what sort of code to generate. (If no match was found, then built-in rules were used, and if none of those applied, then no code was generated for the field.)
Each instruction was a string of the form "#<keyword>:optional,arguments". That was tokenized to provide the keyword and the optional arguments. The keyword was converted to an enum, and that drove a switch statement for generating code. For example:
The #config:username instruction specified that code should be
generated to assign the value of the username variable in a runtime
configuration file to the field.
The #order:price instruction specified that code should be generated
to assign the value returned from calling orderObj->getPrice() to the field.
The #string:foobar instruction specified the string literal foobar
should be assigned to the field.
The #expr:_heartbeatInterval * 1000 instruction specified that code should
be generated to assign the value of the expression _heartbeatInterval * 1000
to the field.
The #ignore instruction specified that no code should be generated to
assign a value to the field.
The #now instruction specified that code should be generated to assign
the current clock time to the field.
I have used the above technique in several projects, and each time I have invented instructions specific to the needs of the particular project. If you decide to use this technique, then obviously you will need to invent instructions to specify runtime translations rather than instructions to generate code. Also, don't feel you have to shoehorn all of your translation-based configuration into a single table. For example, you might use one table to provide a source URL -> destination URL mapping, and a different table to provide instructions for translating fields within messages.
If this technique works as well for you as it has worked for me on my projects, then you will end up with your translation application being an "engine" whose behaviour is driven entirely by a configuration file that, in effect, is a DSL (domain-specific language). That DSL file is likely to be quite compact (less than 100 lines), and will be the part of the application that is visible to users. Because of this, it is worthwhile investing effort to make the DSL as intuitive and easy-to-read/modify as possible, because doing that will make the translation application: (1) user friendly, and (2) easy to document in a user manual.

Question in developer guide for Kafka dsl api

I got a question when walking myself thru this awesome guide
https://kafka.apache.org/20/documentation/streams/developer-guide/dsl-api.html
My question is in section "Example of semantics for table aggregations". In particular, look at the table in this section, at timestamp 4, but what is the mechanism for aggregator to perform "(E, 5 - 5)".
My confusing is since the key is already transformed from name ("alice") to region ("A") at grouping step. How "groupedTable" can still sense the original key in aggregate and perform subtraction?
Thanks in advance.

There are two mechanism in place here:
the base store can get the old value for a key from the store, before it puts the new value into the store
if required, the upstream operator hosting the base store, will send both the new and old value to the downstream operator

Patterns for this type of problem

I need advise regarding the pattern to use for the following problem.
There are many rows -- let us call it messages(identified by MSG_ID in DB)-- in a table which corresponds to a file. Means, the file has been split into many pieces and put into database.
So parts corresponding to file can be identified using a GROUP_ID column, and MSG_ID corresponds to individual message. The primary key is a combination of GROUP_ID and MSG_ID.
Now, each message consists of n number of logical records(which are typically payment instructions(k x 128 bytes of data)). Where current reading payment instruction ends can be said only after reading the next 128 characters. Also parts of a payment instruction can be in consecutive messages. Which means a complete payment instruction can be spread across end of MSG_ID n and start of MSG_ID n+1.
We are using spring batch to do the processing.
I have tried querying the table and writing all the records to a flat file one by one and start the spring batch from there.
I would like to know whether there's any pattern which I can use to achieve the requirement without using a flat file.
Like,
Read MSG_ID 1 and GROUP_ID "ABC" from db
Seperate the payment instructions and give each instruction to the processor.
When end of MSG_ID 1 is reached check whether the final record in hand form a Payment Instruction, if not read MSG_ID 2 and GROUP_ID "ABC" and append to the previous left over record.
Read till MSG_ID==N where N is known before starting the read process.
Is there any ItemReaders in Spring or Iterator patterns in java which I can use ?
To be more clear, there are patterns for handling, "IF your logical record is spread in multiple rows". Is there any pattern for "IF one row in DB contains 'M' number logical records, where M may not be an integer, use this type of Iterator or ItemReaders"
Thanks.

I believe you should write a custom Reader (processing logic) for your scenario. It certainly doesn't seem as a common case.
The algorithm you proposed seems OK. You should have no trouble writing a Reader which reads a complete payment instruction and hands it off for further processing.

How to best represent Constants (Enums) in the Database (INT vs VARCHAR)?

what is the best solution in terms of performance and "readability/good coding style" to represent a (Java) Enumeration (fixed set of constants) on the DB layer in regard to an integer (or any number datatype in general) vs a string representation.
Caveat: There are some database systems that support "Enums" directly but this would require to keept the Database Enum-Definition in sync with the Business-Layer-implementation. Furthermore this kind of datatype might not be available on all Database systems and as well might differ in the syntax => I am looking for an easy solution that is easy to mange and available on all database systems. (So my question only adresses the Number vs String representation.)
The Number representation of a constants seems to me very efficient to store (for example consumes only two bytes as integer) and is most likely very fast in terms of indexing, but hard to read ("0" vs. "1" etc)..
The String representation is more readable (storing "enabled" and "disabled" compared to a "0" and "1" ), but consumes much mor storage space and is most likely also slower in regard to indexing.
My questions is, did I miss some important aspects? What would you suggest to use for an enum representation on the Database layer.
Thank you very much!

In most cases, I prefer to use a short alphanumeric code, and then have a lookup table with the expanded text. When necessary I build the enum table in the program dynamically from the database table.
For example, suppose we have a field that is supposed to contain, say, transaction type, and the possible values are Sale, Return, Service, and Layaway. I'd create a transaction type table with code and description, make the codes maybe "SA", "RE", "SV", and "LY", and use the code field as the primary key. Then in each transaction record I'd post that code. This takes less space than an integer key in the record itself and in the index. Exactly how it is processed depends on the database engine but it shouldn't be dramatically less efficient than an integer key. And because it's mnemonic it's very easy to use. You can dump a record and easily see what the values are and likely remember which is which. You can display the codes without translation in user output and the users can make sense of them. Indeed, this can give you a performance gain over integer keys: In many cases the abbreviation is good for the users -- they often want abbreviations to keep displays compact and avoid scrolling -- so you don't need to join on the transaction table to get a translation.
I would definitely NOT store a long text value in every record. Like in this example, I would not want to dispense with the transaction table and store "Layaway". Not only is this inefficient, but it is quite possible that someday the users will say that they want it changed to "Layaway sale", or even some subtle difference like "Lay-away". Then you not only have to update every record in the database, but you have to search through the program for every place this text occurs and change it. Also, the longer the text, the more likely that somewhere along the line a programmer will mis-spell it and create obscure bugs.
Also, having a transaction type table provides a convenient place to store additional information about the transaction type. Never ever ever write code that says "if whatevercode='A' or whatevercode='C' or whatevercode='X' then ..." Whatever it is that makes those three codes somehow different from all other codes, put a field for it in the transaction table and test that field. If you say, "Well, those are all the tax-related codes" or whatever, then fine, create a field called "tax_related" and set it to true or false for each code value as appropriate. Otherwise when someone creates a new transaction type, they have to look through all those if/or lists and figure out which ones this type should be added to and which it shouldn't. I've read plenty of baffling programs where I had to figure out why some logic applied to these three code values but not others, and when you think a fourth value ought to be included in the list, it's very hard to tell whether it is missing because it is really different in some way, or if the programmer made a mistake.
The only type I don't create the translation table is when the list is very short, there is no additional data to keep, and it is clear from the nature of the universe that it is unlikely to ever change so the values can be safely hard-coded. Like true/false or positive/negative/zero or male/female. (And hey, even that last one, obvious as it seems, there are people insisting we now include "transgendered" and the like.)
Some people dogmatically insist that every table have an auto-generated sequential integer key. Such keys are an excellent choice in many cases, but for code lists, I prefer the short alpha key for the reasons stated above.

I would store the string representation, as this is easy to correlate back to the enum and much more stable. Using ordinal() would be bad because it can change if you add a new enum to the middle of the series, so you would have to implement your own numbering system.
In terms of performance, it all depends on what the enums would be used for, but it is most likely a premature optimization to develop a whole separate representation with conversion rather than just use the natural String representation.

Storing a 2 dimensional table (decision table) in XML for efficient Query(ies)

I need to implement a Routing Table where there are a number of paramters.
For eg, i am stating five attributes in the incoming message below
Customer Txn Group Txn Type Sender Priority Target
UTI CORP ONEOFF ABC LOW TRG1
UTI GOV ONEOFF ABC LOW TRG2
What is the best way to represent this data in XML so that it can be queried efficiently.
I want to store this data in XML and using Java i would load this up in memory and when a message comes in i want to identify the target based on the attributes.
Appreciate any inputs.
Thanks,
Manglu

Here is a pure XML representation that can be processed very efficiently as is, without the need to be converted into any other internal data structure:
<table>
<record Customer="UTI" Txn-Group="CORP"
Txn-Type="ONEOFF" Sender="ABC1"
Priority="LOW" Target="TRG1"/>
<record Customer="UTI" Txn-Group="Gov"
Txn-Type="ONEOFF" Sender="ABC2"
Priority="LOW" Target="TRG2"/>
</table>
There is an extremely efficient way to query data in this format using the <xsl:key> instruction and the XSLT key() function:
This transformation:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes"/>
<xsl:key name="kRec" match="record"
use="concat(#Customer,'+',#Sender)"/>
<xsl:template match="/">
<xsl:copy-of select="key('kRec', 'UTI+ABC2')"/>
</xsl:template>
</xsl:stylesheet>
when applied on the above XML document produces the desired result:
<record Customer="UTI"
Txn-Group="Gov" Txn-Type="ONEOFF"
Sender="ABC2" Priority="LOW"
Target="TRG2"/>
Do note the following:
There can be multiple <xsl:key>s defined that identify a record using different combinations of values to be concatenated together (whatever will be considered "keys" and/or "primary keys").
If an <xsl:key> is defined to use the concatenation of "primary keys" then a unique record (or no record) will be found when the key() function is evaluated.
If an <xsl:key> is defined to use the concatenation of "non-primary keys", then more than one record may be found when the key() function is evaluated.
The <xsl:key> instruction is the equivalent of defining an index in a database. This makes using the key() function extremely efficient.
In many cases it is not necessary to convert the above XML form to an intermediary data structure, due neither to reasons of understandability nor of efficiency.

If you're loading it into memory, it doesn't really matter what form the XML takes - make it the easiest to read or write by hand, I would suggest. When you load it into memory, then you should transform it into an appropriate data structure. (The exact nature of the data structure would depend on the exact nature of the requirements.)
EDIT: This is to counter the arguments made in comments by Dimitre:
I'm not sure whether you thought I was suggesting that people implement their own hashtable - I certainly wasn't. Just keep a straight hashtable or perhaps a MultiMap for each column which you want to use as a key. Developers know how to use hashtables.
As for the runtime efficiency, which do you think is going to be more efficient:
You build some XSLT (and bear in mind this is foreign territory, at least relatively speaking, for most developers)
XSLT engine parses it. This step may be avoidable if you're using an XSLT library which lets you just parameterise an existing query. Even so, you've got some extra work to do.
XSLT engine hits hashtables (you hope, at least) and returns a node
You convert the node into a more useful data structure
Or:
You look up appropriate entries in your hashtable based on the keys you've been given, getting straight to a useful data structure
I think I'd trust the second one, personally. Using XSLT here feels like using a screwdriver to bash in a nail...

That depends on what is repeating and what could be empty. XML is not known for its efficient queryability, as it is neither fixed-length nor compact.

I agree with the previous two posters - you should definitely not keep the internal representation of this data in XML when querying as messages come in.
The XML representation can be anything, you could do something like this:
<routes>
<route customer="UTI" txn-group="CORP" txn-type="ONEOFF" .../>
...
</routes>
My internal representation would depend on the format of the message coming in, and the language. A simple representation would be a map, mapping a structure of data (i.e. the key fields from which the routing decision is made) to the info on the target route.
Depending on your performance requirements, you could keep the key/target information as strings, though in any high performing system you'd probably want to do a straight memory comparison (in C/C++) or some form integer comparison.

Yeah, your basic problem is that you're using "XML" and "efficient" in the same sentence.
Edit: No, seriously, yer killin' me. The fact that several people in this thread are using "highly efficient" to describe anything to do with operations on a data format that require string parsing just to find out where your fields are shows that several people in this thread do not even know what the word "efficient" means. Downvote me as much as you like for saying it. I can take it, coach.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.