Best delimiter to separate multipe regex - java

I need to put multiple regular expressions in a single string and then parse it back as separate regex. Like below
regex1<!>regex2<!>regex3
Problem is I am not sure which delimiter will be best to use to separate the expressions in place of <!> shown in example, so that I can safely split the string when parsing it back.
Constraints are, I can not make the string in multiple lines or use xml or json string. Because this string of expressions should be easily configurable.
Looking forward for any suggestion.
Edited:
Q: Why does it have to be a single string?
A: The system has a configuration manager that loads config from properties file. And properties are containing lines like
com.some.package.Class1.Field1: value
com.some.package.Class1.Expressions: exp1<!>exp2<!>exp3
There is no way to write the value in multiple lines in the properties file. That's why.

The best way would be to use invalid regex as delimiter such as ** Because if it is used in normal regex it won't work and would throw an exception{NOTE:++ is valid}
regex1+"**"+regex2
Now you can split it with this regex
(?<!\\\\)[*][*](?![*])
------- -----
| |->to avoid matching pattern like "A*"+"**"+"n+"
|->check if * is not escaped
Following is a list of invalid regex
[+
(+
[*
(*
[?
*+
** (delimiter would be (?<!\\\\)[*][*](?![*]))
??(delimiter would be (?<!\\\\)[?][?](?![?]))
While splitting you need to check if they are escaped
(?<!\\\\)delimiter

Best delimiter is depends upon your requirement. But for best practice use sequesnce of special characters so that possibility of occureance of this sequesnce is minimal
like
$$**##$$
#$%&&%$#

i think its something helpful for u
First you have to replace tag content with single special character and then split
String inputString="regex1<!>regex2<!>regex3";
String noHTMLString = inputString.replaceAll("\\<.*?>","-");
String[] splitString1 = (noHTMLString.split("[-]+"));
for (String string : splitString1) {
System.out.println(string);
}

Related

How to validate String in Java by matches?

To validate String in Java I can use String.matches(). I would like to validate a simple string "*.txt" where "*" means anything. Input e.g. test.txt is correct, but test.tt is not correct, because of ".tt". I tried to use matches("[*].txt"), but it doesn't work. How can I improve this matches? Thanks.
Do not use code, you don't understand!
For your simple problem you could totally avoid using a regular expression and just use
yourString.endsWith(".txt")
and if you want to perform this comparison case insensitive (i.e. allow ".TXT" or ".tXt") use
yourString.toLowerCase().endsWith(".txt")
If you want to learn more about regular expressions in java, I'd recomment a tutorial. For example this one.
You may try this for txt files:
"file.txt".matches("^.*[.]txt$")
Basically ^ means the start of your string. .* means match anything greedy, hence as much as you can get to make the expression match. And [.] means match the dot character. The suffix txt is just the txt text itself. And finally $ is the anchor for the end of the string, which ensures that the string does not contain anything more.
Use .+, it means any character having one or unlimited lengths. It will ensure to avoid the inputs like only .txt
matches(".+[.]txt")
FYI: [.] simply matches with the dot character.

Informatica - replace characters in target

Im working in a project which we create hundreds of xml's in informatica everyday, and all the data which is in the xml should be filtered, like removing all kind of special characters like * +.. You get the idea.
Adding regular expressions for every port is too complicated and not possible due the large amount of mapping we have.
I've added a custom property to the session XMLAnyTypeToString=Yes; and now i get some of the characters instead of &abcd, in their usual presentation (" + , ..).
I'm hoping for some custom property or change in XML target to remove these characters completely.
any idea?
You can make use of String.replaceAll method:
String replaceAll(String regex, String replacement)
Replaces each substring of this string that matches the given regular expression with the given replacement.
You can create a set of symbols you want to remove using regex, for example:
[*+"#$%&\(",.\)]
and then apply it to your string:
String myString = "this contains **symbols** like these $#%#$%";
String cleanedString = myString.replaceAll("[*+"#$%&]", "");
now you "cleanedString" is free of the symbols you've chosen.
By the way, you can test your regex expression in this excellent site:
http://www.regexplanet.com/advanced/java/index.html

Java Splitting a String

I have this string
G234101,Non-Essential,ATPases,Respiration chain complexes,"Auxotrophies, carbon and",PS00017,2,IONIC HOMEOSTASIS,mitochondria.
That I have been trying to split in java. The file is comma delimeted but some of the strings have commas within them and I don't want them to get split up. Currently in the above example
"Auxotrophies, carbon and"
is getting split into two strings.
Any suggestions on how to best split this up by comma's. Not all of the strings have the " " for example the following string:
G234103,Essential,Protein Kinases,?,Cell cycle defects,PS00479,2,CELLULAR COMMUNICATION/SIGNAL TRANSDUCTION,cytoplasm.
http://opencsv.sourceforge.net/
But if you really do need to reinvent the wheel (homework), you need to use a more complicated regular expression than just "what,ever".split(","). It's not simple though. And you might be better off creating your own custom Lexer. http://en.wikipedia.org/wiki/Lexical_analysis
This isn't too hard in your case. As you process your text character by character you just need to keep track of opening and closing quotes to decide when to ignore commas and when to act on them.
Also see StreamTokenizer for a built-in configurable Lexer - you should be able to use this to meet your requirements.
I would think that this would be a multi step process. First, find all the comma's in quotes from your original string, replace it with something like {comma}. You can do this with some regex. Then on the new string, split the new string with the comma symbol(,). Then go through your list, and replace the {comma} with the comma symbol {,}.

Java Regex - exclude empty tags from xml

let's say I have two xml strings:
String logToSearch = "<abc><number>123456789012</number></abc>"
String logToSearch2 = "<abc><number xsi:type=\"soapenc:string\" /></abc>"
String logToSearch3 = "<abc><number /></abc>";
I need a pattern which finds the number tag if the tag contains value, i.e. the match should be found only in the logToSearch.
I'm not saying i'm looking for the number itself, but rather that the matcher.find method should return true only for the first string.
For now i have this:
Pattern pattern = Pattern.compile("<(" + pattrenString + ").*?>",
Pattern.CASE_INSENSITIVE);
where the patternString is simply "number". I tried to add "<(" + pattrenString + ")[^/>].*?> but it didn't work because in [^/>] each character is treated separately.
Thanks
This is absolutely the wrong way to parse XML. In fact, if you need more than just the basic example given here, there's provably no way to solve the more complex cases with regex.
Use an easy XML parser, like XOM. Now, using xpath, query for the elements and filter those without data. I can only imagine that this question is a precursor to future headaches unless you modify your approach right now.
So a search for "<number[^/>]*>" would find the opening tag. If you want to be sure it isn't empty, try "<number[^/>]*>[^<]" or "<number[^/>]*>[0-9]"

how to built regular expression to get value between two single quotes and if there is no single qoute, extract between commas

Problem that i face:
-I have an input string, a SQL statement that i need to parse
-extract the value that need to be insert base on the column name specify
-i can extract the value that is wrap in between 2 single quotes, but:
--?what about value that has no single quotes wrap at them? (like: integer or double)
--?what if the value inside already has single quotes? (like: 'James''s dictionary')
Below is the sample input string:
INSERT INTO LJS1_DX (base, doc, key1, key2, no, sq, eq, ln, en, date, line)
VALUES ('GET','','#000210','',' 0',' 1','5',1,0,'20100706','Street''James''s dictionary')
The Java Code i have below match value between two single quotes only:
Pattern p = Pattern.compile("'.*?'");
columnValues = "'GET0','','#000210','',' 0',' 1','5',1,0,'20100706','Street''James''s dictionary'";
Matcher m = p.matcher(columnValues); // get a matcher object
StringBuffer output = new StringBuffer();
while (m.find()) {
logger.trace(m.group());
}
Appreciate if anyone can provide any guideline or sample to this question.
Thank you!!
I agree with gnibbler that this is a job for a csv parser.
A regex that works on your example would be
'(?:''|[^'])*'|[^',]+
which looks challenging to debug and maintain, doesn't it?
Explanation:
' # First alternative: match an "opening" '
(?: # followed by either...
'' # two ' in a row (escaped ')
| # or...
[^'] # any character that is not a '
)* # zero or more times,
' # then match a "closing" '
| # or (second alternative):
[^',\s]+ # match any run of characters except ', comma or whitespace
It also works if there is whitespace around the values/commas (and will leave that out of the match).
Regex are not really suitable for this. You will always find cases that fail
A csv parser such as opencsv is probably a better option
In general, when you need to parse complex langauges, regexps are not the best tool - there's too much context to make sense of. So, if reading XML use an XML parser, if reading C code, use a C language parser and if reading SQL ...
There's a Java SQL parser here, I would use somethink like this.
For other languages it may be best to use a "YACC"-like parser. For example JACK
instead you can get all values using subString after Values keyword. Same way we can get names also. then you will have two comma-separated string which can be converted to array and you will have a arrays for names and values. you can then check which param has which value .
hope this helps.
I think Tim had the right idea; it just needs to be implemented more efficiently. Here's a much more efficient version:
'[^']*+(?:''[^']*+)*+'|[^',\s]++
It uses Friedl's "unrolled loop" technique to avoid excessive reliance on alternations that match one or two characters at a time (I think that's what did you in, Tim), plus possessive quantifiers throughout.
Regular expressions are not easy to use with this (but everything is possible).
I would suggest parsing it yourself, or use a library to do the parsing. By writing the parser yourself you are certain that it works exactly as you need it to.

Categories