Java jcabi xpath returns unescaped text - java

Consider the following:
String s = "<tag>This has a <a href=\"#\">link<a>.</tag>";
final XML xml = new XMLDocument(s);
String extractedText = xml.xpath("//tag/text()").get(0);
System.out.println(extractedText); // Output: This has a link.
System.out.println(s.contains(extractedText)); // Output: false!
System.out.println(s.contains("This has a <a href=\"#\">link<a>.")); // Output: true
I have an XML file given as a string with some escaped HTML. Using the jcabi library, I get the text of the relevant elements (in this case everything in <tag>s). However, what I get isn't actually what's in the original string--I'm expecting < and > but am getting < and > instead. The original string paradoxically does not contain the substring that I extracted from it.
How can I get the actual text and not an unescaped version?

Related

GWT - extract text in between two characters

In GWT i have a servlet that returns an image from the database to the client. I need to extract out part of the string to properly show the image. What is returned in chrome, firefox, and IE has a slash in the src part. Ex: String s = "src=\""; Which is not visible in the string below. Maybe the slash is adding more parentheses around the http string. Im not sure?
what is returned in those 3 browsers is = <img style="-webkit-user-select: none;" src="http://localhost:8080/dashboardmanager/downloadfile?entityId=4886">
EDGE browser doesn't have the slash in the src so my method to extract the image doesnt work in edge
What edge returns:
String edge = "<img src=”http://localhost:8080/dashboardmanager/downloadfile?entityId=4886”>";
Problem: I need to extract the string below.
http://localhost:8080/dashboardmanager/downloadfile?entityId=4886
either with src= or src=\
What I tried and works with the browsers that return without the parentheses "src=\":
String s = "src=\"";
int index = returned.indexOf(s) + s.length();
image.setUrl(returned.substring(index, returned.indexOf("\"", index + 1)));
But fails to work in EDGE because it doesnt return a slash
I do not have access to Pattern, and matcher in GWT.
How can i extract and keep in mind the entityId number will change
http://localhost:8080/dashboardmanager/downloadfile?entityId=4886
out of what is returned string above?
EDIT:
I need a generic way to extract out http://localhost:8080/dashboardmanager/downloadfile?entityId=4886
When the string might look like this both ways.
String edge = "<img src=”http://localhost:8080/dashboardmanager/downloadfile?entityId=4886”>";
3 browsers is = <img style="-webkit-user-select: none;" src="http://localhost:8080/dashboardmanager/downloadfile?entityId=4886">
public static void main(String[] args) {
String toParse = "<img style=\"-webkit-user-select: none;\" src=\"http://localhost:8080/dashboardmanager/downloadfile?entityId=4886\">";
String delimiter = "src=\"";
int index = toParse.indexOf(delimiter) + delimiter.length();
System.out.println(toParse.substring(index, toParse.length()).split("\"")[0]);
}

How to keep link title attribute with jsoup?

Using Jsoup.clean(), jsoup turns the title attribute of a HTML link from:
TEST
into:
TEST
This is the demo application:
Whitelist whitelist = new Whitelist();
whitelist.addTags("a");
whitelist.addAttributes("a", "href", "title");
String input = "TEST";
System.out.println("input: " + input);
String output = Jsoup.clean(input, whitelist);
System.out.println("output: " + output);
which prints:
input: TEST
output: TEST
I tried to add OutputSettings with EscapeMode:
OutputSettings outputSettings = new OutputSettings();
outputSettings.escapeMode(EscapeMode.xhtml);
EscapeMode.base and EscapeMode.extend have no effect. EscapeMode.xhtml prints the following:
input: TEST
output: TEST
Any idea how jsoup does not manipulate the title tag?
This is a known issue/behavior: https://github.com/jhy/jsoup/issues/684 (marked as "won't fix" by the jsoup team).
There's not a bug here.
When serializing (i.e. in your example when you're printing out XML/HTML), we escape as few characters as necessary. That is why the > is not escaped to >; because it's in a quoted attribute, there's no ambiguity that it's closing a tag, so it doesn't get escaped.

Preserving the <br> tags when cleaning with Jsoup

For the input text:
<p>Arbit string <b>of</b><br><br>text. <em>What</em> to <strong>do</strong> with it?
I run the following code:
Whitelist list = Whitelist.simpleText().addTags("br");
// Some other code...
// plaintext is the string shown above
retVal = Jsoup.clean(plaintext, StringUtils.EMPTY, list,
new Document.OutputSettings().prettyPrint(false));
I get the output:
Arbit string <b>of</b>
text. <em>What</em> to <strong>do</strong> with it?
I don't want Jsoup to convert the <br> tags to line breaks, I want to keep them as-is. How can I do that?
Try this:
Document doc2deal = Jsoup.parse(inputText);
doc2deal.select("br").append("br"); //or append("<br>")
This is not reproducible for me. Using Jsoup 1.8.3 and this code:
String html = "<p>Arbit string <b>of</b><br><br>text. <em>What</em> to <strong>do</strong> with it?";
String cleaned = Jsoup.clean(html,
"",
Whitelist.simpleText().addTags("br"),
new Document.OutputSettings().prettyPrint(false));
System.out.println(cleaned);
I get the following output:
Arbit string <b>of</b><br><br>text. <em>What</em> to <strong>do</strong> with it?
Your problem must be somewhere else I guess.

javascript error with carriage return and newline

I have following string in Java:
"this is text1\r\nthis is text2\r\nthis is text3"
I am replacing \r\n with <br/> as follows:
String temp = "this is text1\r\nthis is text2\r\nthis is text3"
temp = temp.replaceAll("[\r\n]+", "<br/>");
which produces the following string: "is text1 this is text2 this is text3"
now, I want to send it to JavaScript element as follows:
var desc_str = "<%=temp%>";
document.getElementById('proc_desc').value = desc_str;
The output from Java is fine, but after passing to HTML element, I am getting JavaScript error "unterminated String literal", I am not finding the clue, please help.
It sounds like you haven't gotten all of the line breaks out of your String.
You might find this post helpful:
How do I replace all line breaks in a string with <br /> tags?

string.matches(".*") returns false

In my program, I have a string (obtained from an external library) which doesn't match any regular expression.
String content = // extract text from PDF
assertTrue(content.matches(".*")); // fails
assertTrue(content.contains("S P E C I A L")); // passes
assertTrue(content.matches("S P E C I A L")); // fails
Any idea what might be wrong? When I print content to stdout, it looks ok.
Here is the code for extracting text from the PDF (I am using iText 5.0.1):
PdfReader reader = new PdfReader(source);
PdfTextExtractor extractor = new PdfTextExtractor(reader,
new SimpleTextExtractingPdfContentRenderListener());
return extractor.getTextFromPage(1);
By default, the . does not match line breaks. So my guess is that your content contains a line break.
Also note that matches will match the entire string, not just a part of it: it does not do what contains does!
Some examples:
String s = "foo\nbar";
System.out.println(s.matches(".*")); // false
System.out.println(s.matches("foo")); // false
System.out.println(s.matches("foo\nbar")); // true
System.out.println(s.matches("(?s).*")); // true
The (?s) in the last example will cause the . to match line breaks as well. So (?s).* will match any string.

Categories