Splitting long error message - java

I am currently trying to add a few errormessages to my application. For that I am using JOptionPane.showMessageDialog(...);
Basically everything is working as I'd expect it to. But one thing is a bit of a pain.
I am using e.getMessage() to receive the description of the occuring error.
In the case of a sql connection error this is such a long message, that it can't possibly fit to the screen. So I thouht to split it after every sentence, using split([\\.]).
This is working as well, BUT: the message includes a part like this
Error: "java.net.SocketTimeoutException: Receive timed out"., which, of course ends up in:
Error: "java
net
SocketTimeoutException: Receive timed out"
How could I avoid this behaviour? Or is there possibly a better way to achieve the result of a splitted error message?

Why not just split on every space that has dot before it?
Try maybe split("(?<=[.])\\s+")
(?<=[.]) is positive-look-behind. It is used to make sure that group of spaces \\s+ have dot before it, but will not include this dot in match, so it will stay untouched after split, while white-spaces will be removed.

Not sure until your input and expected result are posted in full, but you could use "lookarounds" for that purpose.
For instance:
String input = "Error: \"java.net.SocketTimeoutException: Receive timed out\".";
System.out.println(Arrays.toString(input.split("(?<!\\w)\\.(?!\\w)")));
Output
[Error: "java.net.SocketTimeoutException: Receive timed out"]
Explanation
It splits the String based on (escaped) dot Patterns neither preceded nor followed by any word character
It prints the split Array (here, only 1 element since the package-delimiting dots do not match the Pattern as expected)

An alternative to use regexes is using WordUtils.wrap from the apache.commons.lang package. Using a regex has the advantage of not using an additional library, but makes the code a wee bit more unreadable. In your case not really a big issue, but as an added benefit, commons.lang contains a good deal of useful stuff which might come in handy in your project.
It is one of the libraries which is pretty much a constant in my tool-belt.

Related

Regex for commas and periods allowed

I tried searching for an answer to this question and also reading the Regex Wiki but I couldn't find what I'm looking for exactly.
I have a program that validates a document. (It was written by someone else).
If certain lines or characters don't match the regex then an error is generated. I've noted that a few false errors are always generated and I want to correct this. I believe I have narrowed down the problem to this:
Here is an example:
This error is flagged by the program logic:
ERROR: File header immediate origin name is invalid: CITIBANK, N.A.
Here is the code that causes that error:
if(strLine.substring(63,86).matches("[A-Z,a-z,0-9, ]+")){
}else{
JOptionPane.showMessageDialog(null, "ERROR: File header immediate origin name is invalid: "+strLine.substring(63,86));
errorFound=true;
fileHeaderErrorFound=true;
bw.write("ERROR: File header immediate origin name is invalid: "+strLine.substring(63,86));
bw.newLine();
I believe the reason that the error is called at runtime is because the text contains a period and comma.. I am unsure how to allow these in the regex.
I have tried using this
if(strLine.substring(63,86).matches("[A-Z,a-z,0-9,,,. ]+")){
and it seemed to work I just wanted to make sure that is the correct way because it doesn't look right.
You're right in your analysis, the match failed because there was a dot in the text that isn't contained in the character class.
However, you can simplify the regex - no need to repeat the commas, they don't have any special meaning inside a class:
if(strLine.substring(63,86).matches("[A-Za-z0-9,. ]+"))
Are you sure that you'll never have to match non-ASCII letters or any other kind of punctuation, though?
Alphabets and digits : a-zA-Z0-9 can effectively be replaced by \w denoting 'words'.
The period and comma don't need escaping and can be used as is. Hence this regex might come in handy:
"[\w,.]"
Hope this helps. :)

Regex Pattern Catastrophic backtracking

I have the regex shown below used in one of my old Java systems which is causing backtracking issues lately.
Quite often the backtracking threads cause the CPU of the machine to hit the upper limit and it does not return back until the application is restarted.
Could any one suggest a better way to rewrite this pattern or a tool which would help me to do so?
Pattern:
^\[(([\p{N}]*\]\,\[[\p{N}]*)*|[\p{N}]*)\]$
Values working:
[1234567],[89023432],[124534543],[4564362],[1234543],[12234567],[124567],[1234567],[1234567]
Catastrophic backtracking values — if anything is wrong in the values (an extra brace added at the end):
[1234567],[89023432],[124534543],[4564362],[1234543],[12234567],[124567],[1234567],[1234567]]
Never use * when + is what you mean. The first thing I noticed about your regex is that almost everything is optional. Only the opening and closing square brackets are required, and I'm pretty sure you don't want to treat [] as a valid input.
One of the biggest causes of runaway backtracking is to have two or more alternatives that can match the same things. That's what you've got with the |[\p{N}]* part. The regex engine has to try every conceivable path through the string before it gives up, so all those \p{N}* constructs get into an endless tug-of-war over every group of digits.
But there's no point trying to fix those problems, because the overall structure is wrong. I think this is what you're looking for:
^\[\p{N}+\](?:,\[\p{N}+\])*$
After it consumes the first token ([1234567]), if the next thing in the string is not a comma or the end of the string, it fails immediately. If it does see a comma, it must go on to match another complete token ([89023432]), or it fails immediately.
That's probably the most important thing to remember when you're creating a regex: if it's going to fail, you want it to fail as quickly as possible. You can use features like atomic groups and possessive quantifiers toward that end, but if you get the structure of the regex right, you rarely need them. Backtracking is not inevitable.

How resolve a replaceAll of a replaceAll

I have a little problem.
I have a text that i have to read in browser several time.
Everytime, I open this text, automatically start a replaceAll that i wrote.
It's very simple, basic but that problem is that when i do replace next time (every time i read this text) i have a replaceAll of replaceAll.
For example i have in the text:
XIII
I want to replace it whith
<b>XIII</b>
with:
txt.replaceAll("XIII","<b>XIII</b>")
The first time it's everything fine, but then, when i read again the text, it become:
<b><b>XIII</b></b>
It's a stupid problem, but i start now with Java.
I read that is possibile use regex.Could someone post a little example?
Thanks, and excuse me for my poor english.
You need negative lookbehind to prevent a match on an already marked-up string:
txt.replaceAll("(?<!>)XIII","<b>XIII</b");
This expression looks a bit convoluted, but this is how it decomposes:
(?<! ... ) is the template for the negative lookbehind;
> is the specific character we want to make sure doesn't occur in front of your string.
I should also warn you that fixing up HTML with regex's usually turns into a diabolic cycle of upgrading the regex to handle yet another special case, only to see it fail on the next one. It ends up with a monster that nobody can read, let alone improve.
There's a really fast solution. Do the opposite Replace before doing your own.
Let me show:
txt.replaceAll("<b>XIII</b>","XIII").replaceAll("XIII","<b>XIII</b>")
So you first turn your <b> into normal and than turn it back with <b> and it will achieve the same result without adding the new level of <b>.
What about this:
txt = txt.replaceAll ("XIII", "<b>XIII</b>").
replceAll ("<b><b>", "<b>").replaceAll ("</b></b>", "</b>");
I think <b><b> and </b></b> do not have much sense in HTML, so it is fine to remove duplicates even in other places.

Android: Matcher.find() never returns

First of all, here is a chunk of affected code:
// (somewhere above, data is initialized as a String with a value)
Pattern detailsPattern = Pattern.compile("**this is a valid regex, omitted due to length**", Pattern.DOTALL | Pattern.CASE_INSENSITIVE);
Matcher detailsMatcher = detailsPattern.matcher(data);
Log.i("Scraper", "Initialized pattern and matcher, data length "+data.length());
boolean found = detailsMatcher.find();
Log.i("Scraper", "Found? "+((found)?"yep":"nope"));
I omitted the regex inside Pattern.compile because it's very long, but I know it works with the given data set; or if it doesn't, it shoudn't break anything anyway.
The trouble is, I do get the feedback I/Scraper(23773): Initialized pattern and matcher, data length 18861 but I never see the "Found?" line, it is just stuck on the find() call.
Is this a known Android bug? I've tried it over and over and just can't get it to work. Somehow, I think something over the past few days broke this because my app was working fine before, and I have in the past couple days received several comments of the app not working so it is clearly affecting other users as well.
How can I further debug this?
Some regexes can take a very, very long time to evaluate. In particular, regexes that have lots of quantifiers can cause the regex engine to do a huge amount of backtracking to explore all of the possible ways that the input string might match. And if it is going to fail, it has to explore all of those possibilities.
(Here is an example:
regex = "a*a*a*a*a*a*b"; // 6 quantifiers
input = "aaaaaaaaaaaaaaaaaaaa"; // 20 characters
A typical regex engine will do in the region of 20^6 character comparisons before deciding that the input string does not match.)
If you showed us the regex and the string you are trying to match, we could give a better diagnosis, and possibly offer some alternatives. But if you are trying to extract information from HTML, then the best solution is to not use regexes at all. There are HTML parsers that are specifically designed to deal with real-world HTML.
How long is the string you are trying to parse ?
How long and how complicated is the regex you are trying to match ?
Have you tried to break down your regex down to simpler bits ? Adding up the bits one after another will let you see when it breaks and maybe why.
make some RE like [a-zA-Z]* pass it as argument to compile(),here this example allows only characters small & cap.
Read my blogpost on android validation for more info.
I had the same issue and I solved it replacing all the wildchart . with [\s\S]. I really don't know why it worked for me but it did. I come from Javascript world and I know in there that expression is faster for being evaluated.

Java Regex Engine Crashing

Regex Pattern - ([^=](\\s*[\\w-.]*)*$)
Test String - paginationInput.entriesPerPage=5
Java Regex Engine Crashing / Taking Ages (> 2mins) finding a match. This is not the case for the following test inputs:
paginationInput=5
paginationInput.entries=5
My requirement is to get hold of the String on the right-hand side of = and replace it with something. The above pattern is doing it fine except for the input mentioned above.
I want to understand why the error and how can I optimize the Regex for my requirement so as to avoid other peculiar cases.
You can use a look behind to make sure your string starts at the character after the =:
(?<=\\=)([\\s\\w\\-.]*)$
As for why it is crashing, it's the second * around the group. I'm not sure why you need that, since that sounds like you are asking for :
A single character, anything but equals
Then 0 or more repeats of the following group:
Any amount of white space
Then any amount of word characters, dash, or dot
End of string
Anyway, take out that *, and it doesn't spin forever anymore, but I'd still go for the more specific regex using the look behind.
Also, I don't know how you are using this, but why did you have the $ in there? Then you can only match the last one in the string (if you have more than one). It seems like you'd be better off with a look-ahead to the new line or the end: (?=\\n|$)
[Edit]: Update per comment below.
Try this:
=\\s*(.*)$

Categories