Java get specific part of HTML - java

I'm looping through a load of HTML and I'm trying to just extract the parts I need.
I need to just get 'THISISTHEBITIWANT' from the html below.
<li class="aClass">
example
</li>
<li class="aClass">
example2
</li>
Each time I only want to get the 'THISISTHEBITIWANT' and the text in the link will change.
I've looked at string replace - but as I don't know what 'example' or 'example2' is going to be each time, I can only remove up until 'example/' at the moment.
This was my Java code:
html = inputLine.replace("<li class=\"aClass\"><a href=\"/example/", "");
If anyone could offer any advice, it would be much appreciated!

While the standard way for processing HTML would be to use an HTML parsing library, as the two comments suggest, if you are really only interested in getting the bit you want out, it may suffice to use a regular expression.
import java.util.regex.*;
public class Regular{
public static void main(String[] args) {
String original = "<li class=\"aClass\">\nexample2\n</li>";
Pattern mypattern = Pattern.compile("<li class=\"aClass\">\\s+<a href=\"example/([^\"]+)\"");
Matcher matcher = mypattern.matcher(original);
while (matcher.find()) {
System.out.println(matcher.group(1));
}
}
}

Related

Regex pattern to identify attribute value

I have a source code file which I am trying to read using a automatic Regex processor Class in java.
Although I am unable to form a correct regex pattern to get the values if it appears multiple times in the line.
The input text is:
<input name="id" type="radio" bgcolor="<bean:write name='color'/>" value="<bean:write name='nameProp' property='nameVal'/>" <logic:equal name="checkedStatus" value="0">checked</logic:equal>>
And I want the matcher.find to output following terms:
<bean:write name='color'/>
<bean:write name='nameProp' property='nameVal'/>
Kindly help to form the regex pattern for this scenario.
Use this regex to find those terms:
<bean:write[^\/]*\/>
It will search for the words <bean:write and then everything up until a />
Use it like this:
List<String> matches = new ArrayList<>();
Matcher m = Pattern.compile("<bean:write[^\\/]*\\/>")
.matcher(inputText);
while (m.find()) {
matches.add(m.group());
}
Regex101 Tested
I caution you though with parsing HTML with regex. If you need anything more complicated than this, you should probably consider using an XML parser instead. See this famous answer.

Jsoup remove ONLY html tags

What is proper way to remove ONLY html tags (preserve all custom/unknown tags) with JSOUP (NOT regex)?
Expected input:
<html>
<customTag>
<div> dsgfdgdgf </div>
</customTag>
<123456789/>
<123>
<html123/>
</html>
Expected output:
<customTag>
dsgfdgdgf
</customTag>
<123456789/>
<123>
<html123/>
I tried to use Cleaner with WhiteList.none(), but it removes custom tags also.
Also I tried:
String str = Jsoup.parse(html).text()
But it removes custom tags also.
This answer isn't good for me, because number of custom tags is infinity.
you might want to try something like this:
String[] tags = new String[]{"html", "div"};
Document thing = Jsoup.parse("<html><customTag><div>dsgfdgdgf</div></customTag><123456789/><123><html123/></html>");
for (String tag : tags) {
for (Element elem : thing.getElementsByTag(tag)) {
elem.parent().insertChildren(elem.siblingIndex(),elem.childNodes());
elem.remove();
}
}
System.out.println(thing.getElementsByTag("body").html());
Please note that <123456789/> and <123> don't conform to the xml standard, so they get escaped. Another downside may be that you have to explicitly write down all tags you don't like (aka all html tags) and it may be sloooooow. Haven't looked at how fast this is going to run.
MFG
MiSt

Using OutputRaw in Java Tapestry

I have a web application running Java Tapestry, with a lot of user-inputted content. The only formatting that users may input is linebreaks.
I call a text string from a database, and output it into a template. The string contains line breaks as /r, which I replace with < br >. However, these are filtered on output, so the text looks like b<br>text text b<br> text. I think I can use outputRaw or writeRaw to fix this, but I can't find any info for how to add outputRaw or writeRaw to a Tapestry class or template.
The class is:
public String getText() {
KMedium textmedium = getTextmedium();
return (textmedium == null || textmedium.getTextcontent() == null) ? "" : textmedium.getTextcontent().replaceAll("\r", "<br>");
}
The tml is:
<p class="categorytext" id="${currentCategory.id}">
${getText()}
</p>
Where would I add the raw output handling to have my line breaks display properly?
To answer my own question, this is how to output the results of $getText() as raw html:
Change the tml from this:
<p class="categorytext" id="${currentCategory.id}">
${getText()}
</p>
To this:
<p class="categorytext" id="${currentCategory.id}">
<t:outputraw value="${getText()}"/>
</p>
Note that this is quite dangerous as you are likely opening your site to an XSS attack. You may need to use jsoup or similar to sanitize the input.
An alternative might be:
<p class="categorytext" id="${currentCategory.id}">
<t:loop source="textLines" value="singleLine">
${singleLine} <br/>
</t:loop>
</p>
This assumes a a getTextLines() method that returns a List or array of Strings; it could use the same logic as your getText() but split the result on CRs. This would do a better job when the text lines contain unsafe characters such as & or <. With a little more work, you could add the <br> only between lines (not after each line) ... and this feels like it might be a nice component as well.

Finding fragment in a string (a href) with regex

I have the following snipped:
What i want is just the someurl. However there are variations such as following:
<a target=blank href="$click_tracking_url$&landing_url=someurl" alt=""></a>
I had this regex but doesnt work for variations:
<a href=\".*?landing_url=(.*?)\">
how can i fix it or if there is an easier way to do it?
You did not match all the variations because you did not consider the attributes between href and <a. Try that instead:
Pattern p = Pattern.compile("<a[^>]+href=[\\'\\\"].+&landing_url=(.+?)[\\'\\\"]");

JSP Text Processing with Regex

I have a large number (>1500) of JSP files that I am trying to convert to JSPX. I am using a tool that will parse well-formed JSPs and convert to JSPX, however, my JSPs are not all well-formed :)
My solution is to pre-process the JSPs and convert untidy code so the tool will parse them correctly. The main problem I am trying to resolve is that of unquoted attribute values. Examples:
<INPUT id="foo" size=1>
<input id=body size="2">
My current regex for finding these is (in Java string format):
"(\\w+)=([^\"' >]+)"
And my replacement string is (in Java string format):
"$1=\"$2\""
This works well, EXCEPT for a few patterns, both of which involve inline scriptlets. For example:
<INPUT id=foo value="<%= someBean.method("a=b") %>">
In this case, my pattern matches the string literal "a=b", which I don't want to do. What I'd like to have happen is that the regex would IGNORE anything between <% and %>. Is there a regular expression that will do what I am trying to do?
EDIT:
Changed to title to clarify that I am NOT trying to parse HTML / JSP with regexes... I am doing a simple syntactic transformation to prepare the input for parsing.
If a sentence contains an arbitrary number of matching tokens such as double quotes, then this sentence belongs to a context-free language, which simply cannot be parsed with Regex designed to handle regular languages.
Either there could be some simplification assumptions (e.g. there are no unmatched double quotes and there is only a certain number of those etc.) that would permit the use of Regex, or your need to think about using (creating) a lexer/parser for a case of context-free language. ANTLR is a good tool for this.
Based on the assumption that there are NO unquoted attribute values inside the scriptlets, the following construct might work for you:
Note: this approach is fragile. Just for your reference.
import java.util.regex.*;
public class test{
public static void main(String args[]){
String s = "<INPUT id=foo abbr='ip ' name = bar color =\"blue\" value=\" <%= someBean.method(\" a = b \") %>\" nickname =box >";
Pattern p = Pattern.compile("(\\w+)\\s*=\\s*(\\w+[^\"'\\s])");
Matcher m = p.matcher(s);
while (m.find())
{
System.out.println("Return Value :"+m.group(1)+"="+m.group(2));
}
}
}
Output:
Return Value:id=foo
Return Value:name=bar
Return Value:nickname=box

Categories