Jsoup parser remove words with '<' and '>' - java

I'm using the Jsoup.parse() to remove html tags from a String. But my string as a word like <name> also.
The problem is Jsoup.parse() remove that too. I'ts because that text has < and >. I can't just remove < and > from the text too. How can I do this.
String s1 = Jsoup.parse("<p>Hello World</p>").text();
//s1 is "Hello World". Correct
String s2 = Jsoup.parse("<name>").text();
//s2 is "". But it should be <name> because <name> is not a html tag

I'm using the Jsoup.parse() to remove html tags from a String.
You want to use the Jsoup#clean method. You'll also need a little manual work after because Jsoup will still see <name> as an HTML tag.
// Define the list of words to preserve...
String[] myExceptions = new String[] { "name" };
int nbExceptions = myExceptions.length;
// Build a whitelist for Jsoup...
Whitelist myWhiteList = Whitelist.simpleText().addTags(myExceptions);
// Let Jsoup remove any html tags...
String s2 = Jsoup.clean("<name>", myWhiteList);
// Complete the initial html tags removal...
for (int i = 0; i < nbExceptions; i++) {
s2 = s2.replaceAll("<" + myExceptions[i] + ">.+?</" + myExceptions[i] + ">", "<" + myExceptions[i] + ">");
}
System.out.println(">>" + s2);
OUTPUT
>><name>
References
How to remove HTML tags from a string with Jsoup?
Whitelist javadoc
clean method javadoc

Related

I am not able to make regex for the following String [duplicate]

I have a string like this:
"core/pages/viewemployee.jsff"
From this code, I need to get "viewemployee". How do I get this using Java?
Suppose that you have that string saved in a variable named myString.
String myString = "core/pages/viewemployee.jsff";
String newString = myString.substring(myString.lastIndexOf("/")+1, myString.indexOf("."));
But you need to make the same control before doing substring in this one, because if there aren't those characters you will get a "-1" from lastIndexOf(), or indexOf(), and it will break your substring invocation.
I suggest looking for the Javadoc documentation.
You can solve this with regex (given you only need a group of word characters between the last "/" and "."):
String str="core/pages/viewemployee.jsff";
str=str.replaceFirst(".*/(\\w+).*","$1");
System.out.println(str); //prints viewemployee
You can split the string first with "/" so that you can have each folder and the file name got separated. For this example, you will have "core", "pages" and "viewemployee.jsff". I assume you need the file name without the extension, so just apply same split action with "." seperator to the last token. You will have filename without extension.
String myStr = "core/pages/viewemployee.bak.jsff";
String[] tokens = myStr.split("/");
String[] fileNameTokens = tokens[tokens.length - 1].split("\\.");
String fileNameStr = "";
for(int i = 0; i < fileNameTokens.length - 1; i++) {
fileNameStr += fileNameTokens[i] + ".";
}
fileNameStr = fileNameStr.substring(0, fileNameStr.length() - 1);
System.out.print(fileNameStr) //--> "viewemployee.bak"
These are file paths. Consider using File.getName(), especially if you already have the File object:
File file = new File("core/pages/viewemployee.jsff");
String name = file.getName(); // --> "viewemployee.jsff"
And to remove the extension:
String res = name.split("\\.[^\\.]*$")[0]; // --> "viewemployee"
With this we can handle strings like "../viewemployee.2.jsff".
The regex matches the last dot, zero or more non-dots, and the end of the string. Then String.split() treats these as a delimiter, and ignores them. The array will always have one element, unless the original string is ..
The below will get you viewemployee.jsff:
int idx = fileName.replaceAll("\\", "/").lastIndexOf("/");
String fileNameWithExtn = idx >= 0 ? fileName.substring(idx + 1) : fileName;
To remove the file Extension and get only viewemployee, similarly:
idx = fileNameWithExtn.lastIndexOf(".");
String filename = idx >= 0 ? fileNameWithExtn.substring(0,idx) : fileNameWithExtn;

GWT RegExp - multiple matches

I want to find all the "code" matches in my input string (With GWT RegExp). When I call the "regExp.exec(inputStr)" method it only returns the first match, even when I call it multiple times:
String input = "ff <code>myCode</code> ff <code>myCode2</code> dd <code>myCode3</code>";
String patternStr = "<code[^>]*>(.+?)</code\\s*>";
// Compile and use regular expression
RegExp regExp = RegExp.compile(patternStr);
MatchResult matcher = regExp.exec(inputStr);
boolean matchFound = (matcher != null); // equivalent to regExp.test(inputStr);
if (matchFound) {
// Get all groups for this match
for (int i=0; i<matcher.getGroupCount(); i++) {
String groupStr = matcher.getGroup(i);
System.out.println(groupStr);
}
}
How can I get all the matches?
Edit: Like greedybuddha noted: A regex is not really suited to parse (X)HTML. I gave JSOUP a try and it is much more convienient than with a regex. My code with jsoup now looks like this. I am renaming all code tags and apply them a CSS-Class:
String input = "ff<code>myCode</code>ff<code>myCode2</code>";
Document doc = Jsoup.parse(input, "UTF-8");
Elements links = doc.select("code"); // a with href
for(Element link : links){
System.out.println(link.html());
link.tagName("pre");
link.addClass("prettify");
}
System.out.println(doc);
Compile the regular expression with the "g" flag, for global matching.
RegExp regExp = RegExp.compile(patternStr,"g");
I think you will also want "m" for multiline matching, "gm".
That being said, for HTML/XML parsing you should consider using JSoup or another alternative.

Replace every word with tag

JAVASCRIPT or JAVA solution needed
The solution I am looking for could use java or javascript. I have the html code in a string so I could manipulate it before using it with java or afterwards with javascript.
problem
Anyway, I have to wrap each word with a tag. For example:
<html> ... >
Hello every one, cheers
< ... </html>
should be changed to
<html> ... >
<word>Hello</word> <word>every</word> <word>one</word>, <word>cheers</word>
< ... </html>
Why?
This will help me use javascript to select/highlight a word. It seems the only way to do it is to use the function highlightElementAtPoint which I added in the JAVASCRIPT hint: It simply finds the element of a certain x,y coordinate and highlights it. I figured that if every word is an element, it will be doable.
The idea is to use this approach to allow us to detect highlighted text in an android WebView even if that would mean to use a twisted highlighting method. Think a bit more and you will find many other applications for this.
JAVASCRIPT hint
I am using the following code to highlight a word; however, this will highlight the whole text belonging to a certain tag. When each word is a tag, this will work to some extent. If there is a substitute that will allow me to highlight a word at a certain position, it would also be a solution.
function highlightElementAtPoint(xOrdinate, yOrdinate) {
var theElement = document.elementFromPoint(xOrdinate, yOrdinate);
selectedElement = theElement;
theElement.style.backgroundColor = "yellow";
var theName = theElement.nodeName;
var theArray = document.getElementsByTagName(theName);
var theIndex = -1;
for (i = 0; i < theArray.length; i++) {
if (theArray[i] == theElement) {
theIndex = i;
}
}
window.androidselection.selected(theElement.innerHTML);
return theName + " " + theIndex;
}
Try to use something like
String yourStringHere = yourStringHere.replace(" ", "</word> <word>" )
yourStringHere.replace("<html></word>", "<html>" );//remove first closing word-tag
Should work, maybe u have to change sth...
var tags = document.body.innerText.match(/\w+/g);
for(var i=0;i<tags.length;i++){
tags[i] = '<word>' + tags[i] + '</word>';
}
Or as #ThomasK said:
var tags = document.body.innerText;
tags = '<word>' + tags + '</word>';
tags = tags.replace(/\s/g,'</word><word>');
But you have to keep in mind: .replace(" ",foo) only replaces the space once. For multiple replaces you have to use .replace(/\s+/g,foo)
And as #ajax333221 said, the second way will include commas, dots and other symbols, so the better solution is the first
JSFiddle example: http://jsfiddle.net/c6ftq/4/
inputStr = inputStr.replaceAll("(?<!</?)\\w++(?!\\s*>)","<word>$0</word>");
You can try following code,
import java.util.StringTokenizer;
public class myTag
{
static String startWordTag = "<Word>";
static String endWordTag = "</Word>";
static String space = " ";
static String myText = "Hello how are you ";
public static void main ( String args[] )
{
StringTokenizer st = new StringTokenizer (myText," ");
StringBuffer sb = new StringBuffer();
while ( st.hasMoreTokens() )
{
sb.append(startWordTag);
sb.append(st.nextToken());
sb.append(endWordTag);
sb.append(space);
}
System.out.println ( "Result:" + sb.toString() );
}
}

how to remove anchor tag and make it text

String k= <html>
<a target="_blank" href="http://www.taxmann.com/directtaxlaws/fileopencontainer.aspx?Page=CIRNO&
amp;id=1999033000019320&path=/Notifications/DirectTaxLaws/HTMLFiles/S.O.193(E)30031999.htm&
amp;aa=">number S.O.I93(E), dated the 30th March, 1999
</html>
I'm getting this HTML in a String and I want to remove the anchor tag so that data is also removed from link.
I just want display it as text not as a link.
how to do this i m trying to do so much not able to do please send me code regarding that i m
creating app for Android this issue i m getting in android on web view.
use JSoup, and jSoup.parse()
You can use the following example (don't remember where i've found it, but it works) using replace method to modify the string before showing it:
k = replace ( k, "<a target=\"_blank\" href=", "");
String replace(String _text, String _searchStr, String _replacementStr) {
// String buffer to store str
StringBuffer sb = new StringBuffer();
// Search for search
int searchStringPos = _text.indexOf(_searchStr);
int startPos = 0;
int searchStringLength = _searchStr.length();
// Iterate to add string
while (searchStringPos != -1) {
sb.append(_text.substring(startPos, searchStringPos)).append(_replacementStr);
startPos = searchStringPos + searchStringLength;
searchStringPos = _text.indexOf(_searchStr, startPos);
}
// Create string
sb.append(_text.substring(startPos,_text.length()));
return sb.toString();
}
To substitute all the target with an empty line:
k = replace ( k, "<a target=\"_blank\" href=\"http://www.taxmann.com/directtaxlaws/fileopencontainer.aspx?Page=CIRNO&id=1999033000019320&path=/Notifications/DirectTaxLaws/HTMLFiles/S.O.193(E)30031999.htm&aa=\">", "");
No escape is needed for slash.

Modify large string

I have a large string in the following format -
<a href="12345.html"><a href="12345.html"><a href="12345.html"><a href="12345.html">
<a href="12345.html"><a href="12345.html"><a href="12345.html"><a href="12345.html">
Id like to store all occurances of the value that occurs before .html. So above html becomes something like 12345.html,12345.html,12345.html,12345.html,12345.html,12345.html,12345.html,12345.html
Do I need a regular expression? or some kind of replace method.
Thanks
You don't actually need a regular expression, but you could use the underlying Matcher class:
final String searchString = "12345.html";
final String txt =
"<a href=\"12345.html\"><a href=\"12345.html\"><a href=\"12345.html\"><a href=\"12345.html\">\n"
+ "<a href=\"12345.html\"><a href=\"12345.html\"><a href=\"12345.html\"><a href=\"12345.html\">";
final Matcher matcher = Pattern.compile(searchString, Pattern.LITERAL).matcher(txt);
final StringBuilder sb = new StringBuilder();
while(matcher.find()){
if(sb.length() > 0) sb.append(',');
sb.append(matcher.group());
}
System.out.println(sb.toString());
Output:
12345.html,12345.html,12345.html,12345.html,12345.html,12345.html,12345.html,12345.html
You can use an HTML parser like Jsoup.
Document doc = Jsoup.parse(yourString);
Elements els = doc.select("a");
for(Element el: els){
//this only if needs the number without the HTML
//if not, only el.attr("href")
if(el.attr("href").contains(".html")){
String[] parts = el.attr("href").split(".html");
System.out.println(parts[0]);
}
}
Don't use regex to parse HTML.
If you are accessing this string inside the java code, you can split the string on "=' delimeter. It will result in a bunch of strings. One string will look like "
So the steps are:
1. split the string which will result in string array.
2. Iterate over the resulting array and look for the pattern ">

Categories