How to parse a file containing html using JSOUP?

How to parse a file containing html using JSOUP? - java

I have files containing HTML and I am trying to parse that file and then tokenise the text of the body.
I achieve this through:
docs = JSOUP.parse("myFile","UTF-8","");
System.out.println(docs.boy().text());
The above codes work fine but the problem is TEXT that is present outside of html tags without any tag is also printed as part of the body tags.
I need to find a way to stop this text outside of HTML tags from being read
Help this is a time sensitive question !

You can select and remove unwanted elements in your document.
doc.select("body > :matchText").remove();
The above statement will remove all text-nodes, that are direct children of the body-element. The :matchText selector is rather new, so please make sure to use a somehow recent version of JSoup (1.11.3 definitely works, but 1.10.2 not).
Find more infos on the selector syntax on https://jsoup.org/cookbook/extracting-data/selector-syntax

Related

How to fix hanging html tags in HTML fragment?

I am getting a possibly ill-composed HTML fragment from an external source:
<p>Include all the information someone would need to answer your <i><i>question<p>
How to make it safe for rendering within a bigger HTML document, closing all hanging HTML tags in Java?

You can try to parse incoming string to XML - there is plenty of tools that do that. If it fails it means that HTML is wrongly formatted (for instance not all tags are correctly closed).
If you need better validation you may additionally validate it against XSD.

You can achieve that by writing your own Java custom parser and fixing the tags.
Idea will be like this, get all open tags and find its relevant closing tag in the string.
You can replace with if there is no closing tag founds.
You need to handle duplicates and pre , post valid tags.
Else you can try this opensource handy parses which helps in achieving that.
http://java-source.net/open-source/html-parsers
http://htmlcleaner.sourceforge.net/ looks good option.
Hope this helps.

JSOUP Deleting closing and/or opening divs only

Hello Im googling for hours now and can't find answer...(or smt close to it)
What i am trying to do is, lets say i have this code(very simplified):
<div id="one"><div id="two"><div id="three"></div></div></div>
And what i want to do is delete specific amount of this elements , lets say 2 of them. So the result would be:
<div id="one"><div id="two"><div id="three"></div>
Or i want to delete this opening elements (again specific amount of them, lets say 2 again) but without knowing their full name (so we can assume if real name is id="one_54486464" i know its one_ ... )
So after deleting I get this result:
<div id="three"></div></div></div>
Can anyone suggest way to achieve this results? It does not have to Include JSOUP, any better. more simple or more efficient way is welcomed :) (But i am using JSOUP to parse document to get to the point where i am left with )
I hope i explain myself clearly if you have any question please do ask... Thanks :)
EDIT: Those elements that i want to delete are on very end of the HTML document(so nothing, nothing is behind them not body tag html tag nothing...)
Please keep that HTML document would have many across whole code and i want to delete only specific amount at the end of the document...
For the opening divs THOSE are on very beginning of my HTML document and nothing is before them... So i need to remove specific amount from the beginning without knowing their specific ID only start of it. Also this div has closing too somewhere in the document and that closing i want to keep there.

For the first case, you can get the element's html (using the html() method) and use some String methods on it to delete a couple of its closing tags.
Example:
e.html().replaceAll("(((\\s|\n)+)?<\\/div>){2}$","");
This will remove the last 2 closing div tags, to change the number of tags to be remove, just change the number between the curly brackets {n}
(this is just an example and is probably unreliable, you should use some other String methods to decide which parts to discard)
For the second case, you can select the inner element(s) and add some additional closing tags to it/them.
Example:
String s = e.select("#two").first().html() + "</div></div>";
To select an element that has an ID that starts with some String you can use this e.select("div[id^=two]")
You can find more details on how to select elements here

After Titus suggested regular expressions I decided to write regex for deleting opening divs too.
So I convert Jsoup Document to String then did the parsing on a string and then convert back to Jsoup Document so I can use Jsoup functions.
ADD: What I was doing is that I was parsing and connecting two pages to one seamless. So there was no missing opening div or closing... so my HTML code stay with no errors therefore I was able to convert it back to Jsoup Document without complications.

Read a list of websites, get rid of HTML tags and write it all into a txt file

I am trying to get a list of websites to be read once at a time and printed to a single file. I would also like the html tags to be extracted which I plan to use jsoup for HTML parsing. How would I do this before writing the content to the file?

The Exceptionis quite self-explanative.
There is no next element because, quoting API:
if no more tokens are available
Wrap your assignment in a a while (myScanner.hasNext()) loop after initializing your Scanner.

correcting parsed URLs in java

I am creating a HTML parser that gets the HTML from a given URL, finds the navigation menu html, and puts it into a String. The URLs in the HTML that are being copied into the String need part of the URL added (the "www.stackoverflow.com" part). How can I go about finding the existing URLs in the String and adding the missing part to it so that they work.
The URLs in the String are of the form:
<a href="/qestions/11744851.cfm">
and I need to make them in the following form:
<a href="www.stackoverflow.com/questions/11744851.cfm">

Try using this regular expression with the ReplaceAll() method:
str = subString.replaceAll("<a href=\"(.*)\">", "<a href=\"http://www.stackoverflow/$1\">");

If the XHTML is valid XML, the easiest way is to parse it as XML and use XPath (for example /body/div/a#href , where /body/div is path to menu section in HTML.
There is also a project called HTMLParser (http://htmlparser.sourceforge.net/), you may want to give it a try (according to the page, it has 'link extraction, for crawling through web pages or harvesting email addresses'; but I've never used it, so I can't help much).
If on the other hand the HTML is anything but valid, you may want to use http://ccil.org/~cowan/XML/tagsoup/ - it might work, or it might not, on websites we've tried, it did pretty good.
Edit: adding missing part may be done using simple concatenation after finding interesting parts

Java (Android) regular expression to strip out HTML paragraph

I have an Android application which grabs some data from an external XML source. I've stripped out some HTML from one of the XML elements, but it's in the format:
<p class="x">Some text...</p>
<p>Some more text</p>
<p>Some final text</p>
I want to extract the middle paragraph text, how can I do this? Would a regular expression be the best way? I don't really want to start including external HTML parsing libraries.

RegEx match open tags except XHTML self-contained tags
So, I'll ask the question that wraps up the linked-to answer: have you tried using an XML parser instead?
You might get some ideas from some of the other answers there, too, but I'd try to avoid the regex path. As Macarse suggested, clean this up on the server if you can. If not, wrap those three <p> elements in a single root element and parse it using SAX or something, paying attention to the 2nd paragraph element.

If it's simple, just do a regex.
If you are getting XML from an external source that you own, I would parse it there.

just doing a split: http://developer.android.com/reference/java/lang/String.html#split(java.lang.String)
on "</p><p>" and taking the second entry in the returned array would actually do it pretty quickly

The regex would probably look something like: .*?>(.*?)<.*
And you access the grouped content by calling group(1) on the Matcher object.

If you are going to parse an XML file downloaded from website, then there is nothing to do with Android.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to parse a file containing html using JSOUP? - java

Related

How to fix hanging html tags in HTML fragment?

JSOUP Deleting closing and/or opening divs only

Read a list of websites, get rid of HTML tags and write it all into a txt file

correcting parsed URLs in java

Java (Android) regular expression to strip out HTML paragraph

Categories

Resources