Using Jsoup.clean(), jsoup turns the title attribute of a HTML link from:
TEST
into:
TEST
This is the demo application:
Whitelist whitelist = new Whitelist();
whitelist.addTags("a");
whitelist.addAttributes("a", "href", "title");
String input = "TEST";
System.out.println("input: " + input);
String output = Jsoup.clean(input, whitelist);
System.out.println("output: " + output);
which prints:
input: TEST
output: TEST
I tried to add OutputSettings with EscapeMode:
OutputSettings outputSettings = new OutputSettings();
outputSettings.escapeMode(EscapeMode.xhtml);
EscapeMode.base and EscapeMode.extend have no effect. EscapeMode.xhtml prints the following:
input: TEST
output: TEST
Any idea how jsoup does not manipulate the title tag?
This is a known issue/behavior: https://github.com/jhy/jsoup/issues/684 (marked as "won't fix" by the jsoup team).
There's not a bug here.
When serializing (i.e. in your example when you're printing out XML/HTML), we escape as few characters as necessary. That is why the > is not escaped to >; because it's in a quoted attribute, there's no ambiguity that it's closing a tag, so it doesn't get escaped.
Related
For the input text:
<p>Arbit string <b>of</b><br><br>text. <em>What</em> to <strong>do</strong> with it?
I run the following code:
Whitelist list = Whitelist.simpleText().addTags("br");
// Some other code...
// plaintext is the string shown above
retVal = Jsoup.clean(plaintext, StringUtils.EMPTY, list,
new Document.OutputSettings().prettyPrint(false));
I get the output:
Arbit string <b>of</b>
text. <em>What</em> to <strong>do</strong> with it?
I don't want Jsoup to convert the <br> tags to line breaks, I want to keep them as-is. How can I do that?
Try this:
Document doc2deal = Jsoup.parse(inputText);
doc2deal.select("br").append("br"); //or append("<br>")
This is not reproducible for me. Using Jsoup 1.8.3 and this code:
String html = "<p>Arbit string <b>of</b><br><br>text. <em>What</em> to <strong>do</strong> with it?";
String cleaned = Jsoup.clean(html,
"",
Whitelist.simpleText().addTags("br"),
new Document.OutputSettings().prettyPrint(false));
System.out.println(cleaned);
I get the following output:
Arbit string <b>of</b><br><br>text. <em>What</em> to <strong>do</strong> with it?
Your problem must be somewhere else I guess.
I got the following regex working to search for video links in a page
(http(s?):/)(/[^/]+)\\S+.\\.(?:avi|flv|mp4)
Unfortunately it does not stop at the end of the link if there is another match right behind it, for example this video link
somevideoname.avi
would, after regex return this:
http://somevideo.flv">somevideoname.avi
How can I adjust the regex to avoid this? I would like to learn more about regex, its fascinating but so complex!
Here is how you can do something similar with JSoup parser.
Scanner scanner = new Scanner(new File("input.txt"));
scanner.useDelimiter("\\Z");
String htmlString = scanner.next();
scanner.close();
Document doc = Jsoup.parse(htmlString);
// or to get connect of some page use
// Document doc = Jsoup.connect("http://example.com/").get();
Elements elements = doc.select("a[href]");//find all anchors with href attribute
for (Element el : elements) {
URL url = new URL(el.attr("href"));
if (url.getPath().matches(".*\\.(?:avi|flv|mp4)")) {
System.out.println("url: " + url);
//System.out.println("file: " + url.getPath());
System.out.println("file name: "
+ new File(url.getPath()).getName());
System.out.println("------");
}
}
I'm not sure I understand the groupings in your regexp. At any rate, this one should work:
\\bhttps?://[^\"]+?\\.(?:avi|flv|mp4)\\b
If you only want to extract href attribute values then you're better off matching against the following pattern:
href=("|')(.*?)\.(avi|flv|mp4)\1
This should match "href" followed by either a double-quote or single-quote character, then capture everything up to (and including) the next character which matches the starting quote character. Then your href attribute can be extracted by
matcher.group(2) + "." + matcher.group(3)
to concatenate the file path and name with a period and then the file extension.
Your regex is greedy:
Limit its greediness read this:
(http(s?):/)(/[^/]+?)\\S+.\\.(?:avi|flv|mp4)
I am parsing pages for email data . How would I get a hidden email - which is generated using JavaScript .This is the page I am parsing a page
If you would take a look on the html source(using firebug or something else) you would see that it is a link tag generated inside div named sobi2Details_field_email and set to be display:none .
This is my code for now , but the problem is with email
doc = Jsoup.connect(strLine).get();
Element e5=doc.getElementById("sobi2Details_field_email");
if(e5!=null)
{
emaildata=e5.child(1).absUrl("href").toString();
}
System.out.println (emaildata);
You need to do several steps because Jsoup doesn't allow you to execute JavaScript.
I reverse engineered it and this is what came out:
public static void main(final String[] args) throws IOException
{
final String url = "http://poslovno.com/kategorije.html?sobi2Task=sobi2Details&catid=71&sobi2Id=20001";
final Document doc = Jsoup.connect(url).get();
final Element e5 = doc.getElementById("sobi2Details_field_email");
System.out.println("--- this is how we start");
System.out.println(e5 + "\n\n\n\n");
// remove the xml encoding
System.out.println("---Remove XML encoding\n");
String email = org.jsoup.parser.Parser.unescapeEntities(e5.toString(), false);
System.out.println(email + "\n\n\n\n");
// remove the concatunation with ' + '
System.out.println("--- Remove concatunation (all: ' + ')");
email = email.replaceAll("' \\+ '", "");
System.out.println(email + "\n\n\n\n");
// extract the email address variables
System.out.println("--- Remove useless lines");
Matcher matcher = Pattern.compile("var addy.*var addy", Pattern.MULTILINE + Pattern.DOTALL).matcher(email);
matcher.find();
email = matcher.group();
System.out.println(email + "\n\n\n\n");
// get the to string enclosed by '' and concatunate
System.out.println("--- Extract the email address");
matcher = Pattern.compile("'(.*)'.*'(.*)'", Pattern.MULTILINE + Pattern.DOTALL).matcher(email);
matcher.find();
email = matcher.group(1) + matcher.group(2);
System.out.println(email);
}
If something is generated dynamicly with javascript on client side after response from server is complete, that there is no other way than:
Reverse engineering - figure out what does server side script do, and try to implement same behaviour
Download javascript from processed page, and use java's javascript processor to execute such script and get result (yeah, it is possible, and i was forced to do such thing).Here you have basic example showing how to evaluate javascript in java.
As the title suggest i only want to replace content which starts # and skip content which starts with ! Here is the code snippet. it is not skipping the word which starts with !#
String test = "Hello #Admin Welcome this is Your welcome page !#Admin This is #Admin"
NOTE:- It must skip !#Admin when replacing.
String out = test.replaceAll("#Admin", "MyAdministrator");
log.debug("OutPut: "+out);
OutPut: Hello MyAdministrator Welcome this is Your welcome page !MyAdministrator This is MyAdministrator
How can i Ignore the word which starts with Exclamation mark.
THANKS.
Use a negative lookbehind?
String test = "Hello #Admin Welcome this is Your welcome page !#Admin This is #Admin";
String out = test.replaceAll("(?<!!)#Admin", "MyAdministrator");
System.out.println("OutPut: "+out);
The lookbehind is (?<!!).
try this
String out = test.replaceAll("(?<!!)#Admin", "MyAdministrator");
this is called negative lookbehind, see Pattern API
I have a requirement in my project.
I generate a comment string in javascript.
Coping Option: Delete all codes and replace
Source Proj Num: R21AR058864-02
Source PI Last Name: SZALAI
Appl ID: 7924675; File Year: 7924675
I send this to server where I store it as a string in db and then after that I retrieve it back and show it in a textarea.
I generate it in javascript as :
codingHistoryComment += 'Source Proj Num: <%=mDefault.getFullProjectNumber()%>'+'\n';
codingHistoryComment += 'Source PI Last Name: <%=mDefault.getPILastName()%>'+'\n';
codingHistoryComment += 'Appl ID: <%=mDefault.getApplId()%>; File Year: <%=mDefault.getApplId()%>'+'\n';
In java I am trying to replace the \n to :
String str = soChild2.getChild("codingHistoryComment").getValue().trim();
if(str.contains("\\n")){
str = str.replaceAll("(\r\n|\n)", "<br>");
}
However the textarea still get populated with the "\n" characters:
Coping Option: Delete all codes and replace\nSource Proj Num: R21AR058864-02\nSource PI Last Name: SZALAI\nAppl ID: 7924675; File Year: 7924675\n
Thanks.
In java I am trying to replace the \n to
Don't replace the "\n". A JTextArea will parse that as a new line string.
Trying to convert it to a "br" tag won't help either since a JTextArea does not support html.
I always just use code like the following to populate a text area with text:
JTextArea textArea = new JTextArea(5, 20);
textArea.setText("1\n2\n3\n4\n5\n6\n7\n8\n9\n0");
// automatically wrap lines
jTextArea.setLineWrap( true );
// break lines on word, rather than character boundaries.
jTextArea.setWrapStyleWord( true );
From here.
Here is a test that works, try it out:
String str = "This is a test\r\n test.";
if(str.contains("\r\n")) {
System.out.println(str);
}
Assuming Javascript (since you try to replace with a HTML break line):
A HTML textarea newline should be a newline character \n and not the HTML break line <br>. Try to use the code below to remove extra slashes instead of your current if statement and replace. Don't forget to assign the value to the textarea after the replacement.
Try:
str = str.replaceAll("\\n", "\n");
I think your problem is here:
if(str.contains("\\n")){
Instead of "\\n" you just need "\n"
Then instead of "\n" you need "\\n" here:
str = str.replaceAll("(\r\n|\n)", "<br>");
By the way, the if(str.contains() is not really needed because it won't hurt to run replace all if there is no "\n" characters.