Jsoup.parse(String) and document.toString() and document.html() not the same

Jsoup.parse(String) and document.toString() and document.html() not the same - java

I need change some tag in html String so that I use Jsoup. Here I check but only convert and reverse:
The first I load String from url => str1
Make Document from str1 to edit with jsoup:
Document doc = Jsoup.parse(str1)
Then I use function doc.html() or doc.toString() to convert doc to String => str2
I load str1, str2 to webview by function loadDataWithBaseURL
And see that str2 not same when load str1 (example video frame not fit screen when use str2)
Why and how to fix it ?

JSoup changes relative url's from the input to absolute url's on the output, using the base href you provide.
org.jsoup.Jsoup.parse(String)
Parse HTML into a Document. As no base URI is specified, absolute URL
detection relies on the HTML including a tag.
You likely need to add the base href in your input content or call this method instead:
org.jsoup.Jsoup.parse(String content, String baseUri)

Related

JSoup not reading content from URL with anchor

I'm using JSoup to read content from the following page:
https://www.astrology.com/horoscope/daily/aries.html#Monday
This is the code that I'm using:
String test1 = "https://www.astrology.com/horoscope/daily/aries.html#Monday";
String test2 = "https://www.astrology.com/horoscope/daily/aries.html#Tuesday";
Document document = Jsoup.connect(test1).get();
Element content = document.getElementById("content");
Element p = content.child(0);
String myTest = p.text();
In the URL I can pass the day with an anchor (see test1 and test2 variables) but in both cases it returns the same content, looks like it JSoup is simply ignoring the anchor and just using the base URL: https://www.astrology.com/horoscope/daily/aries.html. Is there a way for JSoup to read an URL with an anchor?

Jsoup ignores the anchor because the relevant information is rendered with JavaScript and Jsoup cannot process it. If you examin the page with your browser's dev tools you'll see that the daily info is found in a json file, like https://www.astrology.com/horoscope/daily/all/aries/2021-03-23/, so you can easily change the date/sign and get whatever you like.

android getText from string resource by keeping html tags

In my app, I want to display a string in a text view with html markup.
this will get the string:
String mystring = getResources().getString(R.string.Terms_And_Conditions);
except won't allow html characters. I can see on forums to use getText() instead but it doesn't work
mystring = getResources().getText(R.string.Terms_And_Conditions);
I get the error
error: incompatible types: CharSequence cannot be converted to String
In all the examples of using html in a text view, they just had a simple hard coded string, and that works fine. But pulling the string from the resources, it just eats all the html, like the html tags aren't their, but also it didn't actually do the markup. Everyone says I need to use getText() instead. So how can I use getText() to get a string resource or how can I display the html in my string resource? Or do I just have to hard code the string directly in Java?
Thank you in advance.

If you want to use getText() you have to proceed like this:
String mystring = activity.getResources().getText(R.string.Terms_And_Conditions).toString()

To allow html characters, you can wrap the resource value in <![CDATA[ ... ]]>
For example, if I want to display the string <button></button>:
strings.xml:
<string name="button_button"><![CDATA[<button></button>]]></string>
MainActivity.kt:
val mystring:String = getResources().getText(R.string.button_button).toString();
Log.wtf("mystring", mystring)
Log output:
E/mystring: <button></button>

Text (URL) converted to wrong Encoding

this is a URL which I have in a text file. When my application reads the file in and converts it to a string it ends up with strange characters being added.
Before:
<p>W3Schools.com</p>
After:
"%3Cp%3E%3Ca%20href%3D%22http%3A//www.w3schools.com%22%20target%3D%22iframe_a%22%3EW3Schools.com%3C/a%3E%3C/p%3E"
I'm aware that this is probably an encoding error, my question is why does it happen and only to URLS? How could I stop it from doing it in Java and iOS.

The string you posted is not a URL, it's HTML that contains a URL.
You can't treat HTML as if it is a valid URL because it's not.

As #Duncan_C said: there's HTML code around your url. You can use the Jsoup library to get rid of that. Once you've done that it will encode properly.
Here's how to do that:
String html = "<p>W3Schools.com</p>";
Document doc = Jsoup.parse(html);
Element link = doc.select("a").first();
String url = link.attr("href");
System.out.println(url);

Jsoup: How to get the returned Documents url, if a redirection was involved in between request and responce

I have a Java web crawler. It is opening this type of urls :
http://jobbank.dk/prefetch_net/s.aspx?c=j&u=/5l/GCyVEQ4dr07BQM6aDvW1I0UefK7VvjHbG5dHDz2P2tCsrbBFYiCBFyAvIdjVnWkl3nwjaUdTp8spu4B9B833lJobgVCKRfM MawPa4AoPK7JvRti4tFFFdmUbtr4LajxRjFH ERBWO7cx43GJ6ColMjDI40vayZSqQ Zl54dK4hqc/nj909Nvb 8Hm9aUmecabYb8Lecyigr3RH/msy NRXW8Le66u2OVepyXyLXHApptPYf2RK42PcqKEawanyjbWAnP8WlT9DaiO/adJ9mEEPIAadtEY/ocN3wSa4=
The final url is different that this, which i guess means that a redirect is involved. I can get and parse the returned Document, but is there any way to get the "final", "real" url too?

That URL is not doing a redirect, is returning a page which has this meta header
<meta http-equiv="refresh" content="0; url=https://krb-xjobs.brassring.com/TGWebHost/jobdetails.aspx?PartnerId=30033&SiteID=5635&type=mail&jobId=722158"-->
You can see your "final" url there.
You can parse the document for this tag with (for example) select("meta[http-equiv=refresh]")
And then parse the content attribute.
Summing up:
Document doc = Jsoup.connect("http://jobbank.dk/prefetch_net/s.aspx?c=j&u=/5l/GCyVEQ4dr07BQM6aDvW1I0UefK7VvjHbG5dHDz2P2tCsrbBFYiCBFyAvIdjVnWkl3nwjaUdTp8spu4B9B833lJobgVCKRfM MawPa4AoPK7JvRti4tFFFdmUbtr4LajxRjFH ERBWO7cx43GJ6ColMjDI40vayZSqQ Zl54dK4hqc/nj909Nvb 8Hm9aUmecabYb8Lecyigr3RH/msy NRXW8Le66u2OVepyXyLXHApptPYf2RK42PcqKEawanyjbWAnP8WlT9DaiO/adJ9mEEPIAadtEY/ocN3wSa4=").get();
Elements select = doc.select("meta[http-equiv=refresh]");
String content = select.first().attr("content");
String prefix = "url=";
String url = content.substring(content.indexOf(prefix) + prefix.length());
System.out.println(url);
Will give you your desired uri:
https://krb-xjobs.brassring.com/TGWebHost/jobdetails.aspx?PartnerId=30033&SiteID=5635&type=mail&jobId=722158
I hope it will help.

extract text from HTML segment using standard java

I'm receiving a segment of HTML document as Java String and i would like to extract it's inner text.
for ex: hello world ----> hello world
is there a way to extract the text using java standard library ?
something maybe more efficient than open/close tag regex with empty string?
thanks,

Don't use regex to parse HTML but a dedicated parser like HtmlCleaner.
Using a regex will usually work at fist test, and then start to be more and more complex until it ends being impossible to adapt.

I will also say it - don't use regex with HTML. ;-)
You can give a shot with JTidy.

Don't use regular expression to parse HTML, use for instance jsoup: Java HTML Parser. It has a convenient way to select elements from the DOM.
Example
Fetch the Wikipedia homepage, parse it to a DOM, and select the headlines from the In the news section into a list of Elements:
Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
Elements newsHeadlines = doc.select("#mp-itn b a");
There is also a HTML parser in the JDK: javax.swing.text.html.parser.Parser, which could be applied like this:
Reader in = new InputStreamReader(new URL(webpageURL).openConnection().getInputStream());
ParserDelegator parserDelegator = new ParserDelegator();
parserDelegator.parse(in, harvester, true);
Then, dependent on what kind you are looking for: start tags, end tags, attributes, etc. you define the appropriate callback function:
#Override
public void handleStartTag(HTML.Tag tag,
MutableAttributeSet mutableAttributeSet, int pos) {
// parses the HTML document until a <a> or <area> tag is found
if (tag == HTML.Tag.A || tag == HTML.Tag.AREA) {
// reading the href attribute of the tag
String address = (String) mutableAttributeSet
.getAttribute(Attribute.HREF);
/* ... */

You can use HTMLParser , this is a open source.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Jsoup.parse(String) and document.toString() and document.html() not the same - java

Related

JSoup not reading content from URL with anchor

android getText from string resource by keeping html tags

Text (URL) converted to wrong Encoding

Jsoup: How to get the returned Documents url, if a redirection was involved in between request and responce

extract text from HTML segment using standard java

Categories

Resources