Regex matches only part of a URL - why?

Regex matches only part of a URL - why? - java

I am very weak in regex and the regex I am using (found from internet) is only partially solving my problem. I need to add an anchor tag to a URL from text input using java. Here is my code:
String text ="Hi please visit www.google.com";
String reg = "\\b(([\\w-]+://?|www[.])[^\\s()<>]+(?:\\([\\w\\d]+\\)|([^[:punct:]\\s]|/)))";
String s = text.replaceAll(reg, "<a href='$1'>$1</a>");
System.out.println(""+s);
The output currently is Hi please visit <a href='www.google.c'>www.google.c</a>om. What's wrong with the regex?
I need to parse a text and display a URL entered from text field as hot link in a jsp page. The actual output expected would be
Hi please visit <a href='www.google.com'>www.google.com</a>
Edit
Following regex
(http(s)?://)?(www(\.\w+)+[^\s.,"']*)
works like a charm in url ending with .com but fails in other extensions like .jsp.Is there any way for it to work in all sort of extension?

To answer your question why the regex doesn't work: It doesn't observe Java's regex syntax rules.
Specifically:
[^[:punct:]\s]
doesn't work as you expect it to because Java doesn't recognize POSIX shorthands like [:punct:]. Instead, it treats that as a nested character class. That again leads to the ^ becoming illegal in that context, so Java ignores it, leaving you with a character class that matches the same as
[:punct\s]
which only matches the c of com, therefore ending your match there.
As for your question of how to find URLs in a block of text, I suggest you read Jan Goyvaert's excellent blog entry Detecting URLs in a block of text. You'll need to decide yourself how sensitive and how specific you want to make your regex.
For example, the solution proposed at the end of the post would translate to Java as
String resultString = subjectString.replaceAll(
"(?imx)\\b(?:(?:https?|ftp|file)://|www\\.|ftp\\.)\n" +
"(?:\\([-A-Z0-9+&#\\#/%=~_|$?!:,.]*\\)|\n" +
" [-A-Z0-9+&#\\#/%=~_|$?!:,.])*\n" +
"(?:\\([-A-Z0-9+&#\\#/%=~_|$?!:,.]*\\)|\n" +
" [A-Z0-9+&#\\#/%=~_|$])", "$0");

Java recognises posix expressions (see javadoc), but the syntax is a little different. It looks like this instead:
\p{Punct}
But I would simplify your regex for a URL to:
(?i)(http(s)?://)?((www(\.\w+)+|(\d{1,3}\.){3}\.\d{1,3})[^\s,"']*(?<!\\.))
And elaborate it only if you find a test case that breaks it.
As a java line it would be:
text = text.replaceAll("(?i)(http(s)?://)?((www(\\.\w+)+|(\\d{1,3}\\.){3}\\d{1,3})[^\\s,\"']*(?<!\\.))", "$3");
Note the neat capture of the "s" in "https" (if found) that is restored if required.

Related

Remove White Spaces between Specific Substring in a String [duplicate]

This question already has answers here:
Which is the best library for XML parsing in java [closed]
(7 answers)
Closed 5 years ago.
cWhats i want is that all the spaces between <abc> tag to be removed and keep the spaces bwtween <efg> tag
<abc>this is between abc</abc><efg>this is between efg</efg>
<efg>this is between efg</efg><abc>this is between abc</abc>
i want output:
<abc>thisisbetweenabc</abc><efg>this is between efg</efg>
<efg>this is between efg</efg><abc>thisisbetweenabc</abc>
string = string.replaceAll("<abc> </abc>", ""); its not working for me

Brief
I urge you to use an XML parser!!! Anyway, if it's a limited, known set of HTML, you can use the following regex (as per my original comment).
Note: This solution only works on a limited, known set of HTML. If you input differs from what you posted in your question it is likely this solution will not work. See Pshemo's comment below your question.
Note 2: The OP changed the format of the input, thus my original answer will no longer work. See original input below. (Exactly why I put a limited, known set of HTML). In the Code section I've added a second regex that works on the OP's newly added input.
Code
See regex in use here
(?:^(<abc>)|\G(?!^))(\S+)[ \t]*
Replace with $1$2
With the new input format, the following regex can be used (as seen in use here):
(?:^(<abc>)|\G(?!^))([^\s<]+)[ \t]*
Results
Input
<abc>this is between abc</abc>
<efg>this is between efg</efg>
<abc>this is between abc</abc>
<efg>this is between efg</efg>
Output
<abc>thisisbetweenabc</abc>
<efg>this is between efg</efg>
<abc>thisisbetweenabc</abc>
<efg>this is between efg</efg>
Explanation
(?:^(<abc>)|\G(?!^)) Match either of the following
^(<abc>) Match the following
^ Assert position at the start of the line
(<abc>) Capture <abc> literally into capture group 1
\G(?!^) Assert position at the end of the previous match
(\S+) Capture any non-whitespace character one or more times into capture group 2
[ \t]* Match space or tab characters any number of times

Simple just do
xml = my overall string with <abc> and </abc> stuff
start = xml.indexOf('<abc>')
end = xml.indexOf('</abc>')
totalCharsToInclude = end - start (get the length to run from start)
abcOnly = xml.subString(start, totalCharsToInclude),
abcOnly = abcOnly.replace(" ", "")
This is all pseduo code, but you can easily mimic it. You may also have to tweak the indexes with plus or minus, I am not in front of your code to test it, but you should be able to get what you need from this.
Disclaimer: Using XML parser is far better way to handle this, then manipulating strings, but I'll assume you have your reasons, so I'll answer the question you asked, instead of telling you to go get XML parser lol. Good luck.

Undoing automatic linkification using Java and Regex

I am working with a database whose entries contain automatically generated html links: each URL was converted to
URL
I want to undo these links: the new software will generate the links on the fly. Is there a way in Java to use .replaceAll or a Regex method that will replace the fragments with just the URL (only for those cases where the URLs match)?
To clarify, based on the questions below: the existing entries will contain one or more instances of linkified URLs. Showing an example of just one:
I visited http://www.amazon.com/ to buy a book.
should be replaced with
I visited http://www.amazon.com/ to buy a book.
If the URL in the href differs in any way from the link text, the replacement should not occur.

You can use this pattern with replaceAll method:
<a (?>[^h>]++|\Bh|h(?!ref\b))*href\s*=\s*["']?(http://)?([^\s"']++)["']?[^>]*>\s*+(?:http://)?\2\s*+<\/a\s*+>
replacement: $1$2
I wrote the pattern as a raw pattern thus, don't forget to escape double quotes and using double backslashes before using it.
The main interest of this pattern is that urls are compared without the substring http:// to obtain more results.

First, a reminder that regular expressions are not great for parsing XML/HTML: this HTML should parse out the same as what you've got, but it's really hard to write a regex for it:
<
a
foo="bar"
href="URL">
<nothing/>URL
</a
>
That's why we say "don't use regular expressions to parse XML!"
But it's often a great shortcut. What you're looking for is a back-reference:
\1
This will match when the quoted string and the contents of the a-element are the same. The \1 matches whatever was captured in group 1. You can also use named capturing groups if you like a little more documentation in your regular expressions. See Pattern for more options.

Regular Expressions to match an <a> tag

I am writing a small java program for a class, and I can't quite figure out why my regex isn't working properly. In the special case of having 2 tags on the same line that is read in, it only matches the second one.
Here is a link that has the regex included, along with a simple set of test data:
Regex Test Link.
In my java program I have the following code:
Pattern p = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);
String[] results;
System.out.println(p.toString());
Matcher m = null;
while((line = input.readLine()) != null) {
m = p.matcher(line);
while(m.find()) {
System.out.println("Matches: " + m.group(1));
}
}
The goal is to extract the href value, as long as it starts with http://, the website ends in either no page (like http://www.google.com) or ends in index.htm or index.html (like http://www.google.com/index.html).
My regex works for every case of the above, but doesnt match in the special case of 2 tags that are on the same line.
Any help is appreciated.

Just use a proper HTML parsing library, such as HTML cleaner. It is theoretically impossible to properly parse HTML with a regex - there are so many constructs that will confound it. For example:
<![CDATA[ > bar ]]>
This is not a link. This is literal text in XHTML.
baz
This is only one link.
<a rel="next" href="bar?2">Next</a>
This is a realistic example of a link with a relation attribute and a relative URI.
<a name="foo">The href="http://example.com" part is the link destination...</a>
This is a named anchor, not a link. However your regex would parse out the literal text here as a link.
Foo
Does your regex handle line-spanning links properly?
There are all kinds of other Fun edge cases that can occur. Save yourself time and headaches. These problems have already been solved and wrapped up in nice neat libraries for you to use. Take advantage of this.
Regexes may be a powerful tool, but as they say - when all you have is a hammer, everything looks like a nail. You are currently trying to hammer in a screw.

This worked for me in that regex tester page
<a[^>]*>[^<]*</a>

Regex Solution
So I was playing around and realized my issue. I adjusted my regex a bit. My main problem was at the beginning my .* was causing everything to match up until the last tag, and therefore it was really only matching once instead of twice. I made that .* lazy and it matched twice instead of once. That was the only issue. Once that regex was added to java, my loop code worked fine.
Thanks everyone that responded. While you may not have provided the answer, your comments got me thinking in the right direction!

You would have to look through all the matches you got per line and find which one looks like a url (like with some more regex ;))

Stripping off urls' in a java string

I've tried this for a couple of hours and wasn't able to do this correctly; so I figured I'd post it here. Here's my problem.
Given a string in java :
"this is <a href='something'>one \nlink</a> some text <a href='fubar'>two \nlink</a> extra text"
Now i want to strip out the link tag from this string using regular expressions - so the resulting string should look like :
"this is one \nlink some text two \nlink extra text"
I've tried all kind of things in java regular expressions; capturing groups, greedy qualifiers - you name it, and still can't get it to work quite right. If there's only one link tag in the string, I can get it work easily. However my string can have multiple url's embedded in it which is what's preventing my expression to work. Here's what i have so far - (?s).*(<a.*>(.*)</a>).*
Note that the string inside the link can be of variable length, which is why i have the .* in the expression.
If somebody can give me a regular expression that'll work, I'll be extremely grateful. Short of looping through each character and removing the links i can't find a solution.

Sometimes it's easier to do it in 2 steps:
s = "this is <a href='something'>one \nlink</a> some text <a href='fubar'>two \nlink</a> extra text"
s.replaceAll("<a[^>]*>", "").replaceAll("</a>", "")
Result: "this is one \nlink some text two \nlink extra text"

Here's the way I usually match tags:
<a .*?>|</a>
and replace with an empty string.
Alternatively, instead of removing the tag, you might comment it out. The match pattern would be the same, but the replacement would be:
<!--\0-->
or
<!--$0-->
If you want to have a reference to the anchor text, use this match pattern:
<a .*?>(.*?)</a>
and the replacement would be an index of 1 instead of 0.
Note: Sometimes you have to use programming-language specific flags to allow regex to match across lines (multi-line pattern match). Here's a Java Example
Pattern aPattern = Pattern.compile(regexString,Pattern.MULTILINE);

Off the top of my head
"<a [^>]*>|</a>"

How to change this regex to properly extract tag attributes - should be simple

I need to "grab" an attribute of a custom HTML tag. I know this sort of question has been asked many times before, but regex really messes with my head, and I can't seem to get it working.
A sample of XML that I need to work with is
<!-- <editable name="nameValue"> --> - content goes here - <!-- </editable> -->
I want to be able to grab the value of the name attribute, which in this case is nameValue. What I have is shown below but this returns a null value.
My regex string (for a Java app, hence the \ to escape the ") is:
"(.)?<!-- <editable name=(\".*\")?> -->.*<!-- </editable> -->(.)?"
I am trying to grab the attribute with quotation marks I figure this is the easiest and most general pattern to match. Well it just doesn't work, any help will help me keep my hair.

I don't think you need the (.)?s at the beginning and end of your regex. And you need to put in a capturing group for getting only the content-goes-here bit:
This worked for me:
String xml = "RANDOM STUFF<!-- <editable name=\"nameValue\"> --> - content goes here - <!-- </editable> -->RANDOM STUFF";
Pattern p = Pattern.compile("<!-- <editable name=(\".*\")?> -->(.*)<!-- </editable> -->");
Matcher m = p.matcher(xml);
if (m.find()) {
System.out.println(m.group(2));
} else {
System.out.println("no match found");
}
This prints:
- content goes here -

Your search is greedy. Use "\<\!-- \<editable name=\"(.*?)\"\> --\>.*?\<\!-- \<\/editable\> --\>" (added ?). Please note that this one will not work correctly with nested <editable> elements.
If you don't want to perform syntax checking, you could also simply go with: "\<\!-- \<editable name=\"(.*?)\"\> --\>" or even "\<editable name=\"(.*?)\"\>" for better simplicity and performance.
Edit: should be
Pattern re = Pattern.compile( "\\<editable name=\"(.*?)\"\\>" );

I use JavaScript, but it should help to make the expression non-greedy where possible and use not matches instead of any character matches. Not sure how similar regexps are with Java, but instead of using the expression \".*\" try using \"[^\"]*\". That will search for any character within the attribute value that isn't a quote, meaning the expression can't match beyond the attribute value.
Hope that helps

Regexes are fundamentally bad at parsing HTML (see Can you provide some examples of why it is hard to parse XML and HTML with a regex? for why). What you need is an HTML parser. See Can you provide an example of parsing HTML with your favorite parser? for examples using a variety of parsers.
You may find the answer using TagSoup helpful.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Regex matches only part of a URL - why? - java

Related

Remove White Spaces between Specific Substring in a String [duplicate]

Undoing automatic linkification using Java and Regex

Regular Expressions to match an <a> tag

Stripping off urls' in a java string

How to change this regex to properly extract tag attributes - should be simple

Categories

Resources