Regex exclusion behavior

Regex exclusion behavior - java

Ok, so I know this question has been asked in different forms several times, but I am having trouble with specific syntax. I have a large string which contains html snippets. I need to find every link tag that does not already have a target= attribute (so that I can add one as needed).
^((?!target).)* will give me text leading up to 'target', and <a.+?>[\w\W]+?</a> will give me a link, but thats where I'm stuck. An example:
<a href="http://www.someSite.com>Link</a> (This should be a match)
Link (this should not be a match).
Any suggestions? Using DOM or XPATH are not really options since this snippet is not well-formed html.

You are being wilfully evil by trying to parse HTML with Regexes. Don't.
That said, you are being extra evil by trying to do everything in one regexp. There is no need for that; it makes your code regex-engine-dependent, unreadable, and quite possibly slow. Instead, simply match tags and then check your first-stage hits again with the trivial regex /target=/. Of course, that character string might occur elsewhere in an HTML tag, but see (1)... you have alrady thrown good practice out of the window, so why not at least make things un-obfuscated so everyone can see what you're doing?

If you insist on doing it with Regex a pattern such as this should help...
<a(?![^>]*target=) [^>]*>.*?</a>
It's by no means 100% perfect technically speaking a tag can contain a > in places other than then end so it won't work for all HTML tags.
NB. I work with PHP, you may have to make slight syntax adjustments for Java.

You could try a negative lookahead like this:
<a(?!.*?target.*?).*?>[\w\W]+?</a>

I didn't test this and spent about a minute writing it, but for your specific example if you can do it on the client-side, try this via the DOM:
var links = document.getElementsByTagName("a");
for (linkIndex=0; linkIndex < links.length; linkIndex++) {
var link = links[linkIndex];
if (link.href && !link.target) {
link.target = "someTarget"
// or link.setAttribute("target", "someTarget");
}
}

Related

Regex for IP and string

Im using this regex online test site.
Here is the regex im using:
\{"ip":"(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$","iphone":"admin/ios","dev":\{"action":"CUS","from":"REG","CUSA":"ADVERT"\}\}
And im trying to match it to:
{"ip":"192.168.50.5","iphone":"admin/ios","dev":{"action":"CUS","from":"REG","CUSA":"ADVERT"}}
When i run the test, it doesn't match, I need it to match on the site above for validation reasons.

A different perspective: it seems that it is already pretty hard to come up with a regex that initially works for you. What does this tell you about how hard will it be in the future to maintain this regex; and maybe extend it?!
What I am saying is: regexes are a good tool; but sometimes overrated. This looks like a string in JSON format. Wouldn't it be better to just take it as that, and use a garden-variety JSON parser instead of trying to build your own regex?
You see, what will be more robust over time - your self baked regex; or some standard library that millions of people are using?
One place to read about JSON parsers would be this question here.

This will be enough for your context.
"ip":"(\d+).(\d+).(\d+).(\d+)"
Edit:
Regex is not for structured data processing, most of the time you need a solution that just works. When sample data changed and doesn't match anymore, you update the regex string to match it again.
Since you want to get four numbers inside a quote pair after a key called "ip", this regex will definitely do it.
If you want something else, please provide more context. Thanks!

How to extract links from a web content?

I have download a web page and I want to extract all the links in that file. this links include absolutes and relatives. for example we have :
<script type="text/javascript" src="/assets/jquery-1.8.0.min.js"></script>
or
<a href="http://stackoverflow.com/" />
so after reading the file, what should I do?

This isn't that complicated to do, if you want to use the builtin regex system from Java. The hard bit is finding the right regex to match URLs[1][2]. For the sake of the answer, I'm gonna just assume you've done that, and stored that as a Pattern with syntax along the lines of this:
Pattern url = Pattern.compile("your regex here");
and some way of iterating through each line. What you'll want to do is define an ArrayList<String>:
ArrayList<String> urlsFound = new ArrayList<>();
From there, you'll have some loop to iterate through your file (assuming each line is a <? extends CharSequence> line), and inside you'll put this:
Matcher urlMatch = url.matcher(line);
while (urlMatch.find()) urlsFound.add(urlMatch.match());
What this does is create a Matcher for your line and the URL-matching Pattern from before. Then, it loops until #find() returns false (i.e., there are no more matches) and adds the match (with #group()) to the list, urlsFound.
At the end of your loop, urlsFound will contain all the matches for all of the URLs on the page. Note that this can get quite memory-intensive if you've got a lot of text, as urlsFound will get quite big, and you'll be creating and ditching a lot of Matchers.
1: I found a few good sites with a quick Google search; the cream of the crop seem to be here and here, as far as I can tell. Your needs may vary.
2: You'll need to make sure that the entire URL is captured with a single group, or this won't work at all. It can be tweaked to work if there are multiple parts, though.

Java library for cleaning up user-entered title to make it show up in a URL?

I am doing a web application. I would like to have a SEO-friendly link such as the following:
http://somesite.org/user-entered-title
The above user-entered-title is extracted from user-created records that have a field called title.
I am wondering whether there is any Java library for cleaning up such user-entered text (remove spaces, for example) before displaying it in a URL.
My target text is something such as "stackoverflow-is-great" after cleanup from user-entered "stackoverflow is great".
I am able to write code to replace spaces in a string with dashes, but not sure what are other rules/ideas/best practices out there for making text part of a url.
Please note that user-entered-title may be in different languages, not just English.
Thanks for any input and pointers!
Regards.

What you want is some kind of "SLUGifying" the prhase into a URL, so it is SEO-friendly.
Once I had that problem, I came to use a solution provided in maddemcode.com. Below you'll find its adapted code.
The trick is to properly use the Normalize JDK class with some little additional cleanup. The usage is simple:
// casingchange-aeiouaeiou-takesexcess-spaces
System.out.println(slugify("CaSiNgChAnGe áéíóúâêîôû takesexcess spaces "));
// these-are-good-special-characters-sic
System.out.println(slugify("These are good Special Characters šíč"));
// some-exceptions-123-aeiou
System.out.println(slugify(" some exceptions ¥123 ã~e~iõ~u!##$%¨&*() "));
// gonna-accomplish-yadda
System.out.println(slugify("gonna accomplish, yadda, 완수하다, 소양양)이 있는 "));
Function code:
public static String slugify(String input) {
return Normalizer.normalize(input, Normalizer.Form.NFD)
.replaceAll("[^\\p{ASCII}]", "")
.replaceAll("[^ \\w]", "").trim()
.replaceAll("\\s+", "-").toLowerCase(Locale.ENGLISH);
}
In the source page (http://maddemcode.com/java/seo-friendly-urls-using-slugify-in-java/) you can take a look at where this comes from. The small snippet above, though, works the same.
As you can see, there are some exceptional chars that aren't converted. To my knowledge, everyone that translates them, uses some kind of map, like Djago's urlify (see example map here). You need them, I believe your best bet is making one.

It seems you want to URL-encode a string. It's possible in core Java, without using external libraries. URLEncoder is the class you need.
Languages other than English shouldn't be a problem as the class allows you to specify the character encoding, which takes care of special characters like accents, etc.

What is the canonical way to test generated HTML code?

The two approaches I usually follow are:
Convert the HTML to a string, and then test it against a target string. The problem with this approach is that it is too brittle, and there'll be very frequent false negatives due to say, things like extra whitespace somewhere.
Convert the HTML to a string and parse it back as an XML, and then use XPath queries to assert on specific nodes. This approach works well but not all HTML comes with closing tags and parsing it as XML fails in such cases.
Both these approaches have serious flaws. I imagine there must be a well-established approach (or approaches) for this sort of tests. What is it?

You could use jsoup or JTidy instead of XML parsing and use your second strategy.

How resolve a replaceAll of a replaceAll

I have a little problem.
I have a text that i have to read in browser several time.
Everytime, I open this text, automatically start a replaceAll that i wrote.
It's very simple, basic but that problem is that when i do replace next time (every time i read this text) i have a replaceAll of replaceAll.
For example i have in the text:
XIII
I want to replace it whith
<b>XIII</b>
with:
txt.replaceAll("XIII","<b>XIII</b>")
The first time it's everything fine, but then, when i read again the text, it become:
<b><b>XIII</b></b>
It's a stupid problem, but i start now with Java.
I read that is possibile use regex.Could someone post a little example?
Thanks, and excuse me for my poor english.

You need negative lookbehind to prevent a match on an already marked-up string:
txt.replaceAll("(?<!>)XIII","<b>XIII</b");
This expression looks a bit convoluted, but this is how it decomposes:
(?<! ... ) is the template for the negative lookbehind;
> is the specific character we want to make sure doesn't occur in front of your string.
I should also warn you that fixing up HTML with regex's usually turns into a diabolic cycle of upgrading the regex to handle yet another special case, only to see it fail on the next one. It ends up with a monster that nobody can read, let alone improve.

There's a really fast solution. Do the opposite Replace before doing your own.
Let me show:
txt.replaceAll("<b>XIII</b>","XIII").replaceAll("XIII","<b>XIII</b>")
So you first turn your <b> into normal and than turn it back with <b> and it will achieve the same result without adding the new level of <b>.

What about this:
txt = txt.replaceAll ("XIII", "<b>XIII</b>").
replceAll ("<b><b>", "<b>").replaceAll ("</b></b>", "</b>");
I think <b><b> and </b></b> do not have much sense in HTML, so it is fine to remove duplicates even in other places.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Regex exclusion behavior - java

You could try a negative lookahead like this: <a(?!.?target.?).*?>[\w\W]+?</a>

Related

Regex for IP and string

How to extract links from a web content?

Java library for cleaning up user-entered title to make it show up in a URL?

What is the canonical way to test generated HTML code?

How resolve a replaceAll of a replaceAll

Categories

Resources

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Regex exclusion behavior - java

You could try a negative lookahead like this: <a(?!.*?target.*?).*?>[\w\W]+?</a>

Related

Regex for IP and string

How to extract links from a web content?

Java library for cleaning up user-entered title to make it show up in a URL?

What is the canonical way to test generated HTML code?

How resolve a replaceAll of a replaceAll

Categories

Resources

You could try a negative lookahead like this: <a(?!.?target.?).*?>[\w\W]+?</a>