Stripping off urls' in a java string - java

I've tried this for a couple of hours and wasn't able to do this correctly; so I figured I'd post it here. Here's my problem.
Given a string in java :
"this is <a href='something'>one \nlink</a> some text <a href='fubar'>two \nlink</a> extra text"
Now i want to strip out the link tag from this string using regular expressions - so the resulting string should look like :
"this is one \nlink some text two \nlink extra text"
I've tried all kind of things in java regular expressions; capturing groups, greedy qualifiers - you name it, and still can't get it to work quite right. If there's only one link tag in the string, I can get it work easily. However my string can have multiple url's embedded in it which is what's preventing my expression to work. Here's what i have so far - (?s).*(<a.*>(.*)</a>).*
Note that the string inside the link can be of variable length, which is why i have the .* in the expression.
If somebody can give me a regular expression that'll work, I'll be extremely grateful. Short of looping through each character and removing the links i can't find a solution.

Sometimes it's easier to do it in 2 steps:
s = "this is <a href='something'>one \nlink</a> some text <a href='fubar'>two \nlink</a> extra text"
s.replaceAll("<a[^>]*>", "").replaceAll("</a>", "")
Result: "this is one \nlink some text two \nlink extra text"

Here's the way I usually match tags:
<a .*?>|</a>
and replace with an empty string.
Alternatively, instead of removing the tag, you might comment it out. The match pattern would be the same, but the replacement would be:
<!--\0-->
or
<!--$0-->
If you want to have a reference to the anchor text, use this match pattern:
<a .*?>(.*?)</a>
and the replacement would be an index of 1 instead of 0.
Note: Sometimes you have to use programming-language specific flags to allow regex to match across lines (multi-line pattern match). Here's a Java Example
Pattern aPattern = Pattern.compile(regexString,Pattern.MULTILINE);

Off the top of my head
"<a [^>]*>|</a>"

Related

extract information from xml using regular expression

I need extract the author from the text using regex. Also, I need have the index of every tags and authors. I tried few parser, none of them can preserve the index correctly. So the only solution is using regex. I have following regex and it has a problem on "[^]"
How could I fix this regex:
<post\\s*author=\"([^\"]+)\"[^>]+>[^</post>]*</post>
in order to extract the author in following text:
<post author="luckylindyslocale" datetime="2012-03-03T04:52:00" id="p7">
<img src="http://img.photobucket.com/albums/v303/lucky196/siggies/ls1.png"/>
Grams thank you, for this wonderful tag and starting this thread. I needed something to encourage me to start making some new tags.
<img src="http://img.photobucket.com/albums/v303/lucky196/holidays/stpatlucky.jpg"/>
Cruelty is one fashion statement we can all do without. ~Rue McClanahan
</post>
Why couldn't regex:
<post\\s*author=\"([^\"]+)\"[^>]+>[^</post>]*</post>
extract the author in following text.
Because
[^</post>]*
represents a character class and will match everything but the characters <, /, p, o, s, t, and > 0 or more times.
That doesn't happen in your text. As for how to fix it, consider using the following regex
<post\s*author=\"([^\"]+?)\"[^>]+>(.|\s)*?<\/post>
// obviously, escape appropriate characters in Java String literals
with a multiline flag.
You can just do it like the following
/<post author="(.*?)"/
Working Demo
The comments are correct though with Regex not being the best tool to parse HTML. But this should do what you are looking for

Undoing automatic linkification using Java and Regex

I am working with a database whose entries contain automatically generated html links: each URL was converted to
URL
I want to undo these links: the new software will generate the links on the fly. Is there a way in Java to use .replaceAll or a Regex method that will replace the fragments with just the URL (only for those cases where the URLs match)?
To clarify, based on the questions below: the existing entries will contain one or more instances of linkified URLs. Showing an example of just one:
I visited http://www.amazon.com/ to buy a book.
should be replaced with
I visited http://www.amazon.com/ to buy a book.
If the URL in the href differs in any way from the link text, the replacement should not occur.
You can use this pattern with replaceAll method:
<a (?>[^h>]++|\Bh|h(?!ref\b))*href\s*=\s*["']?(http://)?([^\s"']++)["']?[^>]*>\s*+(?:http://)?\2\s*+<\/a\s*+>
replacement: $1$2
I wrote the pattern as a raw pattern thus, don't forget to escape double quotes and using double backslashes before using it.
The main interest of this pattern is that urls are compared without the substring http:// to obtain more results.
First, a reminder that regular expressions are not great for parsing XML/HTML: this HTML should parse out the same as what you've got, but it's really hard to write a regex for it:
<
a
foo="bar"
href="URL">
<nothing/>URL
</a
>
That's why we say "don't use regular expressions to parse XML!"
But it's often a great shortcut. What you're looking for is a back-reference:
\1
This will match when the quoted string and the contents of the a-element are the same. The \1 matches whatever was captured in group 1. You can also use named capturing groups if you like a little more documentation in your regular expressions. See Pattern for more options.

is it possible to use replaceAll() with wildcards

Good morning. I realize there are a ton of questions out there regarding replace and replaceAll() but i havnt seen this.
What im looking to do is parse a string (which contains valid html to a point) then after I see the second instance of <p> in the string i want to remove everything that starts with & and ends with ; until i see the next </p>
To do the second part I was hoping to use something along the lines of s.replaceAll("&*;","")
That doesnt work but hopefully it gets my point across that I am looking to replace anything that starts with & and ends with ;
You should probably leave the parsing to a DOM parser (see this question). I can almost guarantee you'll have to do this to find text within the <p> tags.
For the replacement logic, String.replaceAll uses regular expressions, which can do the matching you want.
The "wildcard" in regular expressions that you want is the .* expression. Using your example:
String ampStr = "This &escape;String";
String removed = ampStr.replaceAll("&.*;", "");
System.out.println(removed);
This outputs This String. This is because the . represents any character, and the * means "this character 0 or more times." So .* basically means "any number of characters." However, feeding it:
"This &escape;String &anotherescape;Extended"
will probably not do what you want, and it will output This Extended. To fix this, you specify exactly what you want to look for instead of the . character. This is done using [^;], which means "any character that's not a semicolon:
String removed = ampStr.replaceAll("&[^;]*;", "");
This has performance benefits over &.*?; for non-matching strings, so I highly recommend using this version, especially since not all HTML files will contain a &abc; token and the &.*?; version can have huge performance bottle-necks as a result.
The expression you want is:
s.replaceAll("&.*?;","");
But do you really want to be parsing HTML this way? You may be better off using an XML parser.

How to extract Substring from a String in Java

I have a String like below:
<script language="JavaScript" type="text/javascript" src="http://dns.net/adj/myhost.com/index;size=5x10;zipc=12345;myzon=north_west;|en;tile=10;ord=7jkllk456?"></script>
I want to access whatever is between src=" and ">. I have developed a code something like below:
int i=str.indexOf("src=\"");
str=str.substring(i+5);
i=str.indexOf("\">");
str=str.substring(0,i);
System.out.println(str);
Do you know if this is the right way? My only worry is that sometimes there could be a space between src and = or space between " and > and in this case my code will not work so I was thinking to use Regex. But I am not able to come up with any Regular expression. Do you have any suggestions?
This will work, but you should look into Regular Expressions, they provide a powerful way to spot patterns and extract text accordingly.
If you don't want to bother with regex, you can do this:
testString.split("src\\=")[1].split(">")[0]);
Of course it still doesn't solve your other concerns with different formats, but you can still use an applicable regex (like RanRag's answer) with the String.split() instead of the 5 lines of code you were using.
You can also try this regex src\s+"[=](.*)"\s+>.
Lets break it down
src match for src in string
\s+ look for one or more than one occurence of whitespace
[=] match for equal to
(.*) zero or more than one occurence of text until "\s>
Perhaps this is overkill for your situation, but you might want to consider using an HTML parser. This would take care of all the document formatting issues and let you get at the tags and attributes in a standard way. While Regex may work for simple HTML, once things become more complicated you could run into trouble (false matches or missed matches).
Here is a list of available open source parsers for Java: http://java-source.net/open-source/html-parsers
If there can't be any escaped double quotes in the string you want, try this expression: src="([^"]*)". This will src=" and match anything up to the first " that follows and capture the text between the double quotes into group 1 (group 0 is always the entire matched string).
Since whitespace around = is allowed, you might extend the expression to src\s*=\s*"([^"]*)".
Just a word of warning: HTML isn't a regular language and thus it can't be parsed using regular expressions. For simple cases like this it is ok but don't fall into the trap and think you can parse more complex html structures.

Need regular expr. for html element where order of attributes doesn´t matter

I need a regular expression to detect a span-element where the order of id and class doesn´t matter. The name of the class is always the same, the id is always a fixed number of digits, for example:
<span class="className" id="123">
or
<span id="321" class="className" >
My approach for a regular expression in java was:
String pattern = "<span class=\"className\" id=\"\\d*\">";
but so i can get only one version. Can sombody help?
Thanks, hansa
Don't parse HTML with regular expressions. HTML isn't regular.
This should do it:
String r = "<span (?=[^<>]*\\bclass=\"className\")[^<>]*\\bid=\"(\\d+)\"[^<>]*>";
The lookahead confirms that the span is of the desired class without consuming any characters. Then the rest of the regex, starting from the same position, searches for the id attribute and captures its value. The [^<>]* takes care of any other attributes that might be present, while ensuring that all matching occurs within the tag. (Technically, angle brackets can appear in attribute values, but you probably don't have to worry about that.)
I would do a two step version, first finding the span tag with:
<span[^>]*class=\"classname\"[^>]*>
And then dig out the id from the tags that match the first pattern with
id=\"(\d+)\"
As others have pointed out, it's not a good idea to parse HTML with regular expressions. But for dirty data processing, this is how i would do it.

Categories