I have a puzzle that requires your help : I need to replace certain words with links in an HTML Text.
For example, I have to replace "word" with "<a href="...">word</ a>"
The difficulty is double :
1. not to add links in tag attributes
2. not to add links other links (nested links).
I found a solution to meet the case (1) but I can not handle the case (2).
Here is my simplified code:
String text="sample text <a>sample text</a> sample <a href='http://www.sample.com'>a good sample</a>";
String wordToReplace="sample";
String pattern="\\b"+wordToReplace+"\\b(?![^<>]*+>)"; //the last part is here to solve de problem (1)
String link="["+wordToReplace+"]"; //for more clarity, the generated link is replaced by [...]
System.out.println(text.replaceAll(pattern,link));
The result is:
[sample] text <a>[sample] text</a> [sample] <a href='http://www.sample.com'>a good [sample]</a>
Problem : there is a link in a another link.
Do you have an idea how to solve this problem ?
Thank you in advance
Parsing HTML with regex is always a bad idea, precisely because of odd cases such as this. It would be better to use an HTML parser. Java has a built-in HTML Parser with using Swing that you might want to look into.
Related
a regex question in java.
I'm scraping Id numbers from a element href attribute. I have a bunch on links like these in a string:
Whatever
After the 'pdf' and slash comes an Id number, which I'm interested in.
So I must get all Id's from multiple occurences of this kind of url in the string. What would be the best regex for it?
Thanks in advance.
If you know that the url will be exactly this, your regex can just be:
someplacelol\\.com/pdf/([0-9]+)/
I'm no regex artist but you should be able to get the url out of the element with:
\<a\s.*?href=(?:\"([\w\.:/?=&#%_\-]*)\"|([^\"][\w\.:/?=&#%_\-]*[^\"\>])).*?\>
The first group will contain the URL.
From there you should be able to extract the number without too much difficulty. I tested that link on the source of this page and it was able to correctly identify all of the HREFS in all of the as.
Please don't comment and say It breaks for <a id="<<<>><><<>>href=" href="<a href="> because OP has provided in his description of the problem that ridiculous abuses of the HTTP standard such as this one will not be present in his trail cases.
Also, if for some weird reason, an element has 2 hrefs, only the first will be grabbed. You could probably address that if you cared.
Edit: added whitespace requirement after <a so it won't match things like <asdffsdfsfg href="lol">.
I'm writing an app for a friend but I ran into a problem, the website has these
<span style="display:none">&0000000000000217000000</span>
And we have no idea even what they are, but I need them removed because my app is outputting their value.
Is there any way I can check to see if this is in the Elements and remove it? I have a for-each loop parsing however I cant figure out how to effectively remove this element.
thanks
If you want to remove those spans completely based on the style attribute, try this code:
String html = "<span style=\"display:none\">&0000000000000217000000</span>";
html += "<span style=\"display:none\">&1111111111111111111111111</span>";
html += "<p>Test paragraph should not be removed</p>";
Document doc = Jsoup.parse(html);
doc.select("span[style*=display:none]").remove();
System.out.println(doc);
Here is the output:
<html>
<head></head>
<body>
<p>Test paragraph should not be removed</p>
</body>
</html>
Just try this:
//Assuming you have all the data in a Document called doc:
String cleanData = doc.select("query").text();
The .text(); method will clean all html tags and substitute all encoding, with human readable content. Oh yeah, and then there's the method ownText(); that might help as well. I can't say which will best fit your purposes.
You can use JSOUP to access the innerHTML of the elements, remove the escaped characters, and replace the innerHTML:
Elements elements = doc.select('span');
for(Element e : elements) {
e.html( e.html().replaceAll("&","") );
}
In the above example, get a collection of all of the elements, using the selector for all of the elements that contain the offending character. Afterwards, replace the & with the empty string or whatever character you wish.
Additionally, you should know that & is the escape code for the & character. Without escaping & characters, you may have HTML validation issues. In your case, without additional information, I'm assuming you just really want to eliminate them. If not, this will help get you started. Good luck!
If you need to remove the trailing numbers:
// eliminate ampersand and all trailing numbers
e.html( e.html().replaceAll("&[0-9]*","") );
For more information on regular expressions, see the Javadocs on Regex Pattern.
I am looking for a regular expression to removing all HTML tags from a string in JSP.
Example 1
sampleString = "test string <i>in italics</i> continues";
Example 2
sampleString = "test string <i>in italics";
Example 3
sampleString = "test string <i";
The HTML tag might be complete, partial (without closing tag) or without proper starting tag (missing closing angle bracket in 3rd example) itself.
Thanks in advance
Case 3 is not possible with regex or a parser. It might represent legitimate content. So forget it.
As to the concrete question which covers cases 1 and 2, just use a HTML parser. My favourite is Jsoup.
String text = Jsoup.parse(html).text();
That's it. It has by the way also a HTML cleaner, if that is what you're actually after.
Since you're using JSP, you could also just use JSTL <c:out> or fn:escapeXml() to avoid that user-controlled HTML input get inlined among your HTML (which may thus open XSS holes).
<c:out value="${bean.property}" />
<input type="text" name="foo" value="${fn:escapeXml(param.foo)}" />
HTML tags will then not be interpreted, but just displayed as plain text.
<\/?font(\s\w+(\=\".*\")?)*\>
I used this little gem about a week ago to strip a variety of 12-year-old html tags, and it worked pretty great. Just replace 'font' with whatever tag you're looking for, or with \w* to get rid of all of them.
Edit removed '?' from the end of my string after realizing that could remove non-tag data from a file. Basically, this will consistently find case 1 and 2, but if used with case 3 (with the '?' appended to the end of the regex), caution should be used to ensure what is removed is a tag.
Possible duplicate: RegEx matching HTML tags and extracting text
I need to get the text between the html tag like <p></p> or whatever. My pattern is this
Pattern pText = Pattern.compile(">([^>|^<]*?)<");
Anyone knows some better pattern, because this one its not very usefull. I need it to get for index the content from web page.
Thanks
SO is about to descend on you. But let me be the first to say, don't use regular expressions to parse HTML. Here is a list of Java HTML Parsers. Look around until you see an API that suits your fancy and use that instead.
It looks like you are trying to use the | operator inside a negative set, which is neither working nor needed. Just specify the characters that you don't want to match:
Pattern pText = Pattern.compile(">([^<>]*?)<");
Don't use regular expressions when parsing HTML.
Use XPath instead (if your HTML is well formed). You can reference text nodes using the text() function very easily.
I've tried this for a couple of hours and wasn't able to do this correctly; so I figured I'd post it here. Here's my problem.
Given a string in java :
"this is <a href='something'>one \nlink</a> some text <a href='fubar'>two \nlink</a> extra text"
Now i want to strip out the link tag from this string using regular expressions - so the resulting string should look like :
"this is one \nlink some text two \nlink extra text"
I've tried all kind of things in java regular expressions; capturing groups, greedy qualifiers - you name it, and still can't get it to work quite right. If there's only one link tag in the string, I can get it work easily. However my string can have multiple url's embedded in it which is what's preventing my expression to work. Here's what i have so far - (?s).*(<a.*>(.*)</a>).*
Note that the string inside the link can be of variable length, which is why i have the .* in the expression.
If somebody can give me a regular expression that'll work, I'll be extremely grateful. Short of looping through each character and removing the links i can't find a solution.
Sometimes it's easier to do it in 2 steps:
s = "this is <a href='something'>one \nlink</a> some text <a href='fubar'>two \nlink</a> extra text"
s.replaceAll("<a[^>]*>", "").replaceAll("</a>", "")
Result: "this is one \nlink some text two \nlink extra text"
Here's the way I usually match tags:
<a .*?>|</a>
and replace with an empty string.
Alternatively, instead of removing the tag, you might comment it out. The match pattern would be the same, but the replacement would be:
<!--\0-->
or
<!--$0-->
If you want to have a reference to the anchor text, use this match pattern:
<a .*?>(.*?)</a>
and the replacement would be an index of 1 instead of 0.
Note: Sometimes you have to use programming-language specific flags to allow regex to match across lines (multi-line pattern match). Here's a Java Example
Pattern aPattern = Pattern.compile(regexString,Pattern.MULTILINE);
Off the top of my head
"<a [^>]*>|</a>"