How resolve a replaceAll of a replaceAll - java

I have a little problem.
I have a text that i have to read in browser several time.
Everytime, I open this text, automatically start a replaceAll that i wrote.
It's very simple, basic but that problem is that when i do replace next time (every time i read this text) i have a replaceAll of replaceAll.
For example i have in the text:
XIII
I want to replace it whith
<b>XIII</b>
with:
txt.replaceAll("XIII","<b>XIII</b>")
The first time it's everything fine, but then, when i read again the text, it become:
<b><b>XIII</b></b>
It's a stupid problem, but i start now with Java.
I read that is possibile use regex.Could someone post a little example?
Thanks, and excuse me for my poor english.

You need negative lookbehind to prevent a match on an already marked-up string:
txt.replaceAll("(?<!>)XIII","<b>XIII</b");
This expression looks a bit convoluted, but this is how it decomposes:
(?<! ... ) is the template for the negative lookbehind;
> is the specific character we want to make sure doesn't occur in front of your string.
I should also warn you that fixing up HTML with regex's usually turns into a diabolic cycle of upgrading the regex to handle yet another special case, only to see it fail on the next one. It ends up with a monster that nobody can read, let alone improve.

There's a really fast solution. Do the opposite Replace before doing your own.
Let me show:
txt.replaceAll("<b>XIII</b>","XIII").replaceAll("XIII","<b>XIII</b>")
So you first turn your <b> into normal and than turn it back with <b> and it will achieve the same result without adding the new level of <b>.

What about this:
txt = txt.replaceAll ("XIII", "<b>XIII</b>").
replceAll ("<b><b>", "<b>").replaceAll ("</b></b>", "</b>");
I think <b><b> and </b></b> do not have much sense in HTML, so it is fine to remove duplicates even in other places.

Related

Simple Java regular expression matching fails

Before y'all jump on me for posting something similar to previous questions asked, yes, there seem to be a number of regex related questions but nothing which seems to help me, or at least that I can see.
I am trying to parse strings in JAVA using PATTERN and MATCHER and am really having no joy. My regular expression seems to match my input string when I use a few of the online regular expression testing websites but Java simply does not match my expression.
My input string is:
"Big apple" title="Little Apple" type="Container" url="http://malcolm.com/testing"
The regular expression I am using to match is ".*" title="(.*)" type="Container" url="(.*)"
Essentially I want to pull out the text within the second and the fourth set of quotes. There will always be 4 sets of quotes with text within and around.
I am coding as follows:
Variable XMLSubstring contains the string above (including the quotes) and is as stated, even when I print it out.
Pattern p = Pattern.compile(".* title=\"(.*)\" type=\"Container\" url=\"(.*)\"");
m = p.matcher(XMLSubstring);
It doesn't appear to be rocket science I'm attempting but I'm pulling my hair out trying to debug the bloody thing.
Is there something wrong with my regex pattern?
Is there something wrong with the code I am using?
Am I simply a moron and should stop coding with immediate effect?
EDIT & UPDATE: I have found the problem. My string had a space at the end of it which was breaking the parser! How silly, and I think based on that, I need to accept the third suggestion of mine and give up programming. Thanks all for your assistance.
Try this,
String str="\"Big apple\" title=\"Little Apple\" type=\"Container\" url=\"http://malcolm.com/testing\"";
Pattern p=Pattern.compile(".* title=\\\".*\\\" type=\\\"Container\\\" url=\\\".*\\\"");
Matcher m=p.matcher(str);

vimrc brackets/parenthesis java,c indentation new line

If the code we have looks like
for(...){
}
after reformatting I'd like it to look like
for(...)
{
}
as well for all functions, methods, classes etc.
I found something similar in other article in stackoverflow but it was a regular expression and needed to type every time in the vim console. And I am looking for something to put in the vimrc file (if possible) and to work every time I open it.
Well this is the one I've found:
:%s/^(\s*).*\zs{\s*$/\r\1{/
in http://stackoverflow.com/questions/4463211/is-there-a-way-to-reformat-braces-automatically-with-vim but the thing is it adds a new line even if the bracket is on the right place... and still don't know how to map it to key combination.
(edited with a more accurate pattern)
This should do the trick:
nnoremap <F9> :%s/^\(\s*\).\+\zs{\ze\s*$/\r\1{<cr>
But it it doesn't really sound "safe" to me.
Instead, you could do:
nnoremap <F9> :%s/^\(\s*\).\+\zs{\ze\s*$/\r\1{/c<cr>
which will ask for a confirmation for each match.
Or record a macro and play it back using :global.
edit
Your pattern, :%s/^(\s*).*\zs{\s*$/\r\1{/, is wrong because:
the capture parentheses are not properly escaped, (\s*) instead of \(\s*\)
.* would match any number of any character, including 0 which is why the substitution also works on lines with a single {.

Regex to find variables and ignore methods

I'm trying to write a regex that finds all variables (and only variables, ignoring methods completely) in a given piece of JavaScript code. The actual code (the one which executes regex) is written in Java.
For now, I've got something like this:
Matcher matcher=Pattern.compile(".*?([a-z]+\\w*?).*?").matcher(string);
while(matcher.find()) {
System.out.println(matcher.group(1));
}
So, when value of "string" is variable*func()*20
printout is:
variable
func
Which is not what I want. The simple negation of ( won't do, because it makes regex catch unnecessary characters or cuts them off, but still functions are captured. For now, I have the following code:
Matcher matcher=Pattern.compile(".*?(([a-z]+\\w*)(\\(?)).*?").matcher(formula);
while(matcher.find()) {
if(matcher.group(3).isEmpty()) {
System.out.println(matcher.group(2));
}
}
It works, the printout is correct, but I don't like the additional check. Any ideas? Please?
EDIT (2011-04-12):
Thank you for all answers. There were questions, why would I need something like that. And you are right, in case of bigger, more complicated scripts, the only sane solution would be parsing them. In my case, however, this would be excessive. The scraps of JS I'm working on are intented to be simple formulas, something like (a+b)/2. No comments, string literals, arrays, etc. Only variables and (probably) some built-in functions. I need variables list to check if they can be initalized and this point (and initialized at all). I realize that all of it can be done manually with RPN as well (which would be safer), but these formulas are going to be wrapped with bigger script and evaluated in web browser, so it's more convenient this way.
This may be a bit dirty, but it's assumed that whoever is writing these formulas (probably me, for most of the time), knows what is doing and is able to check if they are working correctly.
If anyone finds this question, wanting to do something similar, should now the risks/difficulties. I do, at least I hope so ;)
Taking all the sound advice about how regex is not the best tool for the job into consideration is important. But you might get away with a quick and dirty regex if your rule is simple enough (and you are aware of the limitations of that rule):
Pattern regex = Pattern.compile(
"\\b # word boundary\n" +
"[A-Za-z]# 1 ASCII letter\n" +
"\\w* # 0+ alnums\n" +
"\\b # word boundary\n" +
"(?! # Lookahead assertion: Make sure there is no...\n" +
" \\s* # optional whitespace\n" +
" \\( # opening parenthesis\n" +
") # ...at this position in the string",
Pattern.COMMENTS);
This matches an identifier as long as it's not followed by a parenthesis. Of course, now you need group(0) instead of group(1). And of course this matches lots of other stuff (inside strings, comments, etc.)...
If you are rethinking using regex and wondering what else you could do, you could consider using an AST instead to access your source programatically. This answer shows you could use the Eclipse Java AST to build a syntax tree for Java source. I guess you could do similar for Javascript.
A regex won't cut in this case because Java isn't regular. Your best best is to get a parser that understands Java syntax and build onto that. Luckily, ANTLR has a Java 1.6 grammar (and 1.5 grammar).
For your rather limited use case you could probably easily extend the variable assignment rules and get the info you need. It's a bit of a learning curve but this will probably be your best best for a quick and accurate solution.
It's pretty well established that regex cannot be reliably used to parse structured input. See here for the famous response: RegEx match open tags except XHTML self-contained tags
As any given sequence of characters may or may not change meaning depending on previous or subsequent sequences of characters, you cannot reliably identify a syntactic element without both lexing and parsing the input text. Regex can be used for the former (breaking an input stream into tokens), but cannot be used reliably for the latter (assigning meaning to tokens depending on their position in the stream).

Android: Matcher.find() never returns

First of all, here is a chunk of affected code:
// (somewhere above, data is initialized as a String with a value)
Pattern detailsPattern = Pattern.compile("**this is a valid regex, omitted due to length**", Pattern.DOTALL | Pattern.CASE_INSENSITIVE);
Matcher detailsMatcher = detailsPattern.matcher(data);
Log.i("Scraper", "Initialized pattern and matcher, data length "+data.length());
boolean found = detailsMatcher.find();
Log.i("Scraper", "Found? "+((found)?"yep":"nope"));
I omitted the regex inside Pattern.compile because it's very long, but I know it works with the given data set; or if it doesn't, it shoudn't break anything anyway.
The trouble is, I do get the feedback I/Scraper(23773): Initialized pattern and matcher, data length 18861 but I never see the "Found?" line, it is just stuck on the find() call.
Is this a known Android bug? I've tried it over and over and just can't get it to work. Somehow, I think something over the past few days broke this because my app was working fine before, and I have in the past couple days received several comments of the app not working so it is clearly affecting other users as well.
How can I further debug this?
Some regexes can take a very, very long time to evaluate. In particular, regexes that have lots of quantifiers can cause the regex engine to do a huge amount of backtracking to explore all of the possible ways that the input string might match. And if it is going to fail, it has to explore all of those possibilities.
(Here is an example:
regex = "a*a*a*a*a*a*b"; // 6 quantifiers
input = "aaaaaaaaaaaaaaaaaaaa"; // 20 characters
A typical regex engine will do in the region of 20^6 character comparisons before deciding that the input string does not match.)
If you showed us the regex and the string you are trying to match, we could give a better diagnosis, and possibly offer some alternatives. But if you are trying to extract information from HTML, then the best solution is to not use regexes at all. There are HTML parsers that are specifically designed to deal with real-world HTML.
How long is the string you are trying to parse ?
How long and how complicated is the regex you are trying to match ?
Have you tried to break down your regex down to simpler bits ? Adding up the bits one after another will let you see when it breaks and maybe why.
make some RE like [a-zA-Z]* pass it as argument to compile(),here this example allows only characters small & cap.
Read my blogpost on android validation for more info.
I had the same issue and I solved it replacing all the wildchart . with [\s\S]. I really don't know why it worked for me but it did. I come from Javascript world and I know in there that expression is faster for being evaluated.

Regex exclusion behavior

Ok, so I know this question has been asked in different forms several times, but I am having trouble with specific syntax. I have a large string which contains html snippets. I need to find every link tag that does not already have a target= attribute (so that I can add one as needed).
^((?!target).)* will give me text leading up to 'target', and <a.+?>[\w\W]+?</a> will give me a link, but thats where I'm stuck. An example:
<a href="http://www.someSite.com>Link</a> (This should be a match)
Link (this should not be a match).
Any suggestions? Using DOM or XPATH are not really options since this snippet is not well-formed html.
You are being wilfully evil by trying to parse HTML with Regexes. Don't.
That said, you are being extra evil by trying to do everything in one regexp. There is no need for that; it makes your code regex-engine-dependent, unreadable, and quite possibly slow. Instead, simply match tags and then check your first-stage hits again with the trivial regex /target=/. Of course, that character string might occur elsewhere in an HTML tag, but see (1)... you have alrady thrown good practice out of the window, so why not at least make things un-obfuscated so everyone can see what you're doing?
If you insist on doing it with Regex a pattern such as this should help...
<a(?![^>]*target=) [^>]*>.*?</a>
It's by no means 100% perfect technically speaking a tag can contain a > in places other than then end so it won't work for all HTML tags.
NB. I work with PHP, you may have to make slight syntax adjustments for Java.
You could try a negative lookahead like this:
<a(?!.*?target.*?).*?>[\w\W]+?</a>
I didn't test this and spent about a minute writing it, but for your specific example if you can do it on the client-side, try this via the DOM:
var links = document.getElementsByTagName("a");
for (linkIndex=0; linkIndex < links.length; linkIndex++) {
var link = links[linkIndex];
if (link.href && !link.target) {
link.target = "someTarget"
// or link.setAttribute("target", "someTarget");
}
}

Categories