Parsing HTML and get all the nodes

Parsing HTML and get all the nodes - java

I need to parse an HTML file in java. Unlike XML there is no repetitive tags. So I need a code that can parse the html file and reach all nodes, it includes nested tags .. etc. The HTML code is not fixed. In other words given any HTML code I need to reach all the tags in the HTML.

try this HTML Parser
http://htmlparser.sourceforge.net/samples.html

I think you need this...
var els=document.getElementsByTagName("*");
for(var i=0;i<els.length;i+)document.write(els.nodeName+"<br />");

Related

how to parse HTML without library in java?

I need to parse an HTML document and get all urls and content of page and save it to database.I don't want to use any library. I can identify link tags using <a tag but how can I extract all content or useful text from html tag?

You can try this one: https://docs.oracle.com/javase/8/docs/api/javax/swing/text/html/parser/Parser.html
Sample of usage:
How to extract info from HTML with Java's own Parser?

Avoid removal of spaces and newline while parsing html using jsoup

I have a sample code as below.
String sample = "<html>
<head>
</head>
<body>
This is a sample on parsing html body using jsoup
This is a sample on parsing html body using jsoup
</body>
</html>";
Document doc = Jsoup.parse(sample);
String output = doc.body().text();
I get the output as
This is a sample on parsing html body using jsoup This is a sample on `parsing html body using jsoup`
But I want the output as
This is a sample on parsing html body using jsoup
This is a sample on parsing html body using jsoup
How do parse it so that I get this output? Or is there another way to do so in Java?

You can disable the pretty printing of your document to get the output like you want it. But you also have to change the .text() to .html().
Document doc = Jsoup.parse(sample);
doc.outputSettings(new Document.OutputSettings().prettyPrint(false));
String output = doc.body().html();

The HTML specification requires that multiple whitespace characters are collapsed into a single whitespace. Therefore, when parsing the sample, the parser correctly eliminates the superfluous whitespace characters.
I don't think you can change how the parser works. You could add a preprocessing step where you replace multiple whitespaces with non-breakable spaces ( ), which will not collapse. The side effect, though, would of course be that those would be, well, non-breakable (which doesn't matter if you really just want to use the rendered text, as in doc.body().text()).

JSoup check if <HTML>,<HEAD> and <BODY> tags are present

Hi I am using JSoup to parse a HTML file. After parsing, I want to check if the file contains the tag. I am using the following code to check that,
htmlDom = parser.parse("<p>My First Heading</p>clk");
Elements pe = htmlDom.select("html");
System.out.println("size "+pe.size());
The output I get is "size 1" even though there is no HTML tag present. My guess is that it is because the HTML tag is not mandatory and that it is implicit. Same is the case for Head and Body tag. Is there any way I could check for sure if these tags are present in the input file?
Thank you.

It does not return 1 because the tag is implicit, but because it is present in the Document object htmlDom after you have parsed the custom HTML.
That is because Jsoup will try to conform the HTML5 Parsing Rules, and thus adds missing elements and tries to fix a broken document structure. I'm quite sure you would get a 1 in return if you were to run the following aswell:
Elements pe = htmlDom.select("head");
System.out.println("size "+pe.size());
To parse the HTML without Jsoup trying to clean or make your HTML valid, you can instead use the included XMLParser, as below, which will parse the HTML as it is.
String customHtml = "<p>My First Heading</p>clk";
Document customDoc = Jsoup.parse(customHtml, "", Parser.xmlParser());
So, as opposed to your assumption in the comments of the question, this is very much possible to do with Jsoup.

How to parse JSP Pages into a XML file?

I am trying to convert a JSP page document into a XML file.I have been using jsoup and very well reading the whole content except server tags, but I can't understand how can the whole HTML be converted to XML tags. I mean how can I fetch data line by line?
My Code:
File Html=new File("genXML.jsp");
Document doc=Jsoup.parse(Html,"UTF-8","http://www.example.com");
System.out.println(doc.html());
Any assistance would be great

First of all, it is not the same to convert JSP to XML with converting HTML to XML. I suppose you want to translate the HTML generated from a JSP to XML. Second of all, you don't want to do this line by line. An HTML block usually does not begin and ends in a line.
Anyway, you could use a tool like tagsoup to convert HTML code to XHTML. XHTML is actually XML. Tagsoup can be called to make the translation. I don't know if it has a usefule API, but at least it could be called from your code as an external process using something like this:
Process tr = Runtime.getRuntime().exec(new String[]{ "..." } );
Then if you want to transform it to a target XML schema, you could apply an XSLT transformation using a tool like ones found online (check this and this). You could apply the XSLT transformation programmatically using JAXP.
Hope I helped!

Extract text between html tags parsed from xml

Can anyone help me in extracting text from within the html tags to plain text?
I have parsed an xml and get some output as body which has html tags now i want to remove the tags and use the text.
thanks in advance!!!!

You can use HTML Parser like JSoup
For example
HTML is
<div style="height:240px;"><br>test: example<br>test1:example1</div>
You can get the html using
Document document = Jsoup.parse(html);
Element div = document.select("div[style=height:240px;]").first();
div.html();

Try a HTML Parser.
If the HTML is escaped, i.e. < instead of < you might have to decode first.

Considering your requirements you might try Jericho HTML Parser
Take a look at TextExtractor class:
Using the default settings, the source segment:
"<div><b>O</b>ne</div><div title="Two"><b>Th</b><script>//a script </script>ree</div>"
produces the text "One Two Three".

If all you want to do is remove HTML tags from a string, you can do this:
String output = input.replaceAll("(?s)\\<.*?\\>", " ");

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Parsing HTML and get all the nodes - java

I need to parse an HTML file in java. Unlike XML there is no repetitive tags. So I need a code that can parse the html file and reach all nodes, it includes nested tags .. etc. The HTML code is not fixed. In other words given any HTML code I need to reach all the tags in the HTML.

try this HTML Parser http://htmlparser.sourceforge.net/samples.html

I think you need this... var els=document.getElementsByTagName("*"); for(var i=0;i<els.length;i+)document.write(els.nodeName+"<br />");

Related

how to parse HTML without library in java?

Avoid removal of spaces and newline while parsing html using jsoup

JSoup check if <HTML>,<HEAD> and <BODY> tags are present

How to parse JSP Pages into a XML file?

Extract text between html tags parsed from xml

Categories

Resources