Sanitize cell values with SuperCSV

Sanitize cell values with SuperCSV - java

What is the best way to sanitize fields from a csv in supercsv? For example the First_Name column: trim the field, capitalize the first letter, remove various characters (quotes, commas, asterisks etc). Is it to write a custom CellProcessor like FmtName()? Maybe another one for FmtEmail() that lowercases everything, removes certain invalid characters?

I think the question you're trying to ask is:
"Is it better to write a custom cell processor that does all the
conversions for a column, or to chain multiple reusable processors
together?"
For example, with your first name example you could either:
a) write a custom cell processor which trimmed, capitalised and replaced all in the one processor:
new ParseFirstName()
b) chain together reusable processors (including the existing Super CSV processors and a new Capitalize custom cell processor that calls StringUtils.capitalize())
new Trim(new Capitalize(new StrReplace("[\",\\*]", "")))
I think it's really up to personal preference. Defining cell processors as done in b) can be quite verbose, but it means you can see all of the conversions/validation for all columns in the one place.
On the other hand, defining a custom cell processor for each column makes your cell processor setup very clean, but you may end up with duplicated code (e.g. if you wanted to capitalize multiple columns) and you can't see all the conversions at once. You'll also have more classes (more code).

Related

Decision Table with a condition on a set containing values coming from lists

Using Drools 6.0.1 I would like to set a CONDITION (on a column of a Decision Table) similar to the following one:
ProductDrools(productCategories contains $param)
(that then would go in every rule after a when in a drl file) where productCategories is a Set (e.g. an HashSet) that, for now, I am only able to check against one string per cell e.g. "categoryA".
I would like to provide in the spreadsheet cell a List, array, Set (any Collection) of multiple strings representing categories e.g. "categoryA", "catB", "catX".
The official documentation for Drools 6.0.1 does not provide enough information regarding operators like contains.
Is this scenario achievable? How? Is there any documentation regarding this you could point me to?

Is this possible to develop some criteria based search on the Strings in C# or JAVA?

I have one List in C#.This String array contains elements of Paragraph that are read from the Ms-Word File.for example,
list 0-> The picture above shows the main report which will be used for many of the markup samples in this chapter. There are several interesting elements in this sample document. First there rae the basic text elements, the primary building blocks for your document. Next up is the table at the bottom of the report which will be discussed in full, including the handy styling effects such as row-banding. Finally the image displayed in the header will be added to finalize the report.
list 1->The picture above shows the main report which will be used for many of the markup samples in this chapter. There are several interesting elements in this sample document. First there rae the basic text elements, the primary building blocks for your document. Various other elements of WordprocessingML will also be handled. By moving the formatting information into styles a higher degree of re-use is made possible. The document will be marked using custom XML tags and the insertion of other advanced elements such as a table of contents is discussed. But before all the advanced features can be added, the base of the document needs to be built.
Some thing like that.
Now My search String is :
The picture above shows the main report which will be used for many of the markup samples in this chapter. There are several interesting elements in this sample document. First there rae the basic text elements, the primary building blocks for your document. Next up is the table at the bottom of the report which will be discussed in full, including the handy styling effects such as row-banding. Before going over all the elements which make up the sample documents a basic document structure needs to be laid out. When you take a WordprocessingML document and use the Windows Explorer shell to rename the docx extension to zip you will find many different elements, especially in larger documents.
I want to check my search String with that list elements.
my criteria is "If each list element contains 85% match or exact match of search string then we want to retrieve that list elements.
In our case,
list 0 -> more satisfies my search string.
list 1 -it also matches some text,but i think below not equal to my criteria...
How i do this kind of criteria based search on String...?
I have more confusion on my problem also
Welcome your ideas and thoughts...

The keyword is DISTANCE or "string distance". and also, "Paragraph similarity"
You seek to implement a function which would express as a scalar, say a percentage as suggested in the question, indicative of how similar a string is from another string.
Plain string distance functions such as hamming or Levenstein may not be appropriate, for they work at character level rather than at word level, but generally these algorithms convey the idea of what is needed.
Working at word level you'll probably also want to take into account some common NLP features, for example ignore (or give less weight to) very common words (such as 'the', 'in', 'of' etc.) and maybe allow for some forms of stemming. The order of the words, or for the least their proximity may also be of import.
One key factor to remember is that even with relatively short strings, many distances functions can be quite expensive, computationally speaking. Before selecting one particular algorithm you'll need to get an idea of the general parameters of the problem:
how many strings would have to be compared? (on average, maximum)
how many words/token do the string contain? (on average, max)
Is it possible to introduce a simple (quick) filter to reduce the number of strings to be compared ?
how fancy do we need to get with linguistic features ?
is it possible to pre-process the strings ?
Are all the records in a single language ?
Comparing Methods for Single Paragraph Similarity Analysis, a scholarly paper provides a survey of relevant techniques and considerations.
In a nutshell, the the amount of design-time and run-time one can apply this relatively open problem varies greatly and is typically a compromise between the level of precision desired vs. the run-time resources and the overall complexity of the solution which may be acceptable.
In its simplest form, when the order of the words matters little, computing the sum of factors based on the TF-IDF values of the words which match may be a very acceptable solution.
Fancier solutions may introduce a pipeline of processes borrowed from NLP, for example Part-of-Speech Tagging (say for the purpose of avoiding false positive such as "SAW" as a noun (to cut wood), and "SAW" as the past tense of the verb "to see". or more likely to filter outright some of the words based on their grammatical function), stemming and possibly semantic substitutions, concept extraction or latent semantic analysis.

You may want to look into lucene for Java or lucene.net for c#. I don't think it'll do the percentage requirement you want out of the box, but it's a great tool for doing text matching.
You maybe could run a separate query for each word, and then work out the percentage yourself of ones that matched.

Here's an idea (and not a solution by any means but something to get started with)
private IEnumerable<string> SearchList = GetAllItems(); // load your list
void Search(string searchPara)
{
char[] delimiters = new char[]{' ','.',','};
var wordsInSearchPara = searchPara.Split(delimiters, StringSplitOptions.RemoveEmptyEntries).Select(a=>a.ToLower()).OrderBy(a => a);
foreach (var item in SearchList)
{
var wordsInItem = item.Split(delimiters, StringSplitOptions.RemoveEmptyEntries).Select(a => a.ToLower()).OrderBy(a => a);
var common = wordsInItem.Intersect(wordsInSearchPara);
// now that you know the common items, you can get the differential
}
}

making a condition from a pattern

I have a table which contains following columns
dependentColumn : values table1.column2, table1.column3, table3.column4....
condition : values ([table1.column2.LAST3][=ABC][OR][=DEF]),
([table1.column2.ALL][=ABC]),
(([table1.column2][=ABC][OR][table1.column2][!="DEF"])[AND]
([table1.column2][!="DEF"]))
...
values: abc, [table1.column1.LAST3]
Now I need to parse the values contained in condition column and write a code containg the conditions and put the values to the dependentColumns
My concern is making java conditions from the conditions mentioned in the 'condition' column. conditions are stored in a pattern. there can be multiple conditions whith ANDs and ORs. How do I tackel the problem. I Know its possible but I am a bit confused.Can I use Stack Class, tyhough I have not used it before.
If there is a simple way out to the solution please tell me

It's not totally clear what you're trying to do from your question but here's my understanding. It looks like you're trying to encode some values into some db objects described by the "dependentColumn" column of a database table where the values are defined by evaluating a domain-specific language (DSL) encoded in the the "condition" column.
One critical aspect is how complex this DSL is. A simple language could be parsed by regular expressions and evaluated using a stack as you mentioned but from your example it looks like you could have grouped boolean expressions which might require the use of an actual parser generator (e.g. ANTLR).

Properties in java - can we have comma-separated keys with single value?

I want to have multiple keys (>1) for a single value in a properties file in my java application. One simple way of doing the define each key in separate line in property file and the same value to all these keys. This approach increases the maintainability of property file. The other way (which I think could be smart way) is define comma separated keys with the value in single line. e.g.
key1,key2,key3=value
Java.util.properties doesn't support this out of box. Does anybody did simillar thing before? I did google but didn't find anything.
--manish

I'm not aware of an existing solution, but it should be quite straightforward to implement:
String key = "key1,key2,key3", val = "value";
Map<String, String> map = new HashMap<String, String>();
for(String k : key.split(",")) map.put(k, val);
System.out.println(map);

One of the nice things about properties files is that they are simple. No complex syntax to learn, and they are easy on the eye.
Want to know what the value of the property foo is? Quickly scan the left column until you see "foo".
Personally, I would find it confusing if I saw a properties file like that.
If that's what you really want, it should be simple to implement. A quick first stab might look like this:
Open file
For each line:
trim() whitespace
If the line is empty or starts with a #, move on
Split on "=" (with limit set to 2), leaving you with key and value
Split key on ","
For each key, trim() it and add it to the map, along with the trim()'d value
That's it.

Since java.util.Properties extends java.util.Hashtable, you could use Properties to load the data, then post-process the data.
The advantage to using java.util.Properties to load the data instead of rolling your own is that the syntax for properties is actually fairly robust, already supporting many of the useful features you might end up having to re-implement (such as splitting values across multiple lines, escapes, etc.).

How to best represent Constants (Enums) in the Database (INT vs VARCHAR)?

what is the best solution in terms of performance and "readability/good coding style" to represent a (Java) Enumeration (fixed set of constants) on the DB layer in regard to an integer (or any number datatype in general) vs a string representation.
Caveat: There are some database systems that support "Enums" directly but this would require to keept the Database Enum-Definition in sync with the Business-Layer-implementation. Furthermore this kind of datatype might not be available on all Database systems and as well might differ in the syntax => I am looking for an easy solution that is easy to mange and available on all database systems. (So my question only adresses the Number vs String representation.)
The Number representation of a constants seems to me very efficient to store (for example consumes only two bytes as integer) and is most likely very fast in terms of indexing, but hard to read ("0" vs. "1" etc)..
The String representation is more readable (storing "enabled" and "disabled" compared to a "0" and "1" ), but consumes much mor storage space and is most likely also slower in regard to indexing.
My questions is, did I miss some important aspects? What would you suggest to use for an enum representation on the Database layer.
Thank you very much!

In most cases, I prefer to use a short alphanumeric code, and then have a lookup table with the expanded text. When necessary I build the enum table in the program dynamically from the database table.
For example, suppose we have a field that is supposed to contain, say, transaction type, and the possible values are Sale, Return, Service, and Layaway. I'd create a transaction type table with code and description, make the codes maybe "SA", "RE", "SV", and "LY", and use the code field as the primary key. Then in each transaction record I'd post that code. This takes less space than an integer key in the record itself and in the index. Exactly how it is processed depends on the database engine but it shouldn't be dramatically less efficient than an integer key. And because it's mnemonic it's very easy to use. You can dump a record and easily see what the values are and likely remember which is which. You can display the codes without translation in user output and the users can make sense of them. Indeed, this can give you a performance gain over integer keys: In many cases the abbreviation is good for the users -- they often want abbreviations to keep displays compact and avoid scrolling -- so you don't need to join on the transaction table to get a translation.
I would definitely NOT store a long text value in every record. Like in this example, I would not want to dispense with the transaction table and store "Layaway". Not only is this inefficient, but it is quite possible that someday the users will say that they want it changed to "Layaway sale", or even some subtle difference like "Lay-away". Then you not only have to update every record in the database, but you have to search through the program for every place this text occurs and change it. Also, the longer the text, the more likely that somewhere along the line a programmer will mis-spell it and create obscure bugs.
Also, having a transaction type table provides a convenient place to store additional information about the transaction type. Never ever ever write code that says "if whatevercode='A' or whatevercode='C' or whatevercode='X' then ..." Whatever it is that makes those three codes somehow different from all other codes, put a field for it in the transaction table and test that field. If you say, "Well, those are all the tax-related codes" or whatever, then fine, create a field called "tax_related" and set it to true or false for each code value as appropriate. Otherwise when someone creates a new transaction type, they have to look through all those if/or lists and figure out which ones this type should be added to and which it shouldn't. I've read plenty of baffling programs where I had to figure out why some logic applied to these three code values but not others, and when you think a fourth value ought to be included in the list, it's very hard to tell whether it is missing because it is really different in some way, or if the programmer made a mistake.
The only type I don't create the translation table is when the list is very short, there is no additional data to keep, and it is clear from the nature of the universe that it is unlikely to ever change so the values can be safely hard-coded. Like true/false or positive/negative/zero or male/female. (And hey, even that last one, obvious as it seems, there are people insisting we now include "transgendered" and the like.)
Some people dogmatically insist that every table have an auto-generated sequential integer key. Such keys are an excellent choice in many cases, but for code lists, I prefer the short alpha key for the reasons stated above.

I would store the string representation, as this is easy to correlate back to the enum and much more stable. Using ordinal() would be bad because it can change if you add a new enum to the middle of the series, so you would have to implement your own numbering system.
In terms of performance, it all depends on what the enums would be used for, but it is most likely a premature optimization to develop a whole separate representation with conversion rather than just use the natural String representation.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.