how to parse HTML without library in java? - java

I need to parse an HTML document and get all urls and content of page and save it to database.I don't want to use any library. I can identify link tags using <a tag but how can I extract all content or useful text from html tag?

You can try this one: https://docs.oracle.com/javase/8/docs/api/javax/swing/text/html/parser/Parser.html
Sample of usage:
How to extract info from HTML with Java's own Parser?

Related

Parse meta tag and get HTML content from body with Tika

I parse files with the great Apache Tika library. I want to extract the metatags with my own parser and then get the content only from the <body>-tag as HTML and store it in a database.
I have tried this now for hours/days :-(, but cannot find a solution:
When I use the ToHTMLContentHandler after the <body>-tag I get exceptions with an invalid namespace without the <html>-tag.
BodyContentHandler just returns the body text without HTML tags.
The tika-app seems to use a TransformerHandler to get HTML (I have never heard of this kind of handlers before.) Can I use this to just get the HTML from the <body>-tag and parse the meta-tags myself? Is this a better way than to use a ToHTMLContentHandler?
Check to see if following links help you a bit..
Content Detection, Metadata and Content Extraction with Apache Tika
Parsing HTML with Apache Tika

Parsing HTML and get all the nodes

I need to parse an HTML file in java. Unlike XML there is no repetitive tags. So I need a code that can parse the html file and reach all nodes, it includes nested tags .. etc. The HTML code is not fixed. In other words given any HTML code I need to reach all the tags in the HTML.
try this HTML Parser
http://htmlparser.sourceforge.net/samples.html
I think you need this...
var els=document.getElementsByTagName("*");
for(var i=0;i<els.length;i+)document.write(els.nodeName+"<br />");

How to strip all html tags and extract content using java?

I have a requirement to escape all html tags from a string and extract only the content. I will have an HTML content as input. for example
<html><body><input type=’text’ value=’Hello World’ size=’50’ /> <div> This is a basic example </div><br/><span align=’center’>Hello Sam!!!</span></body><html>
I need the output as below :
Hello World. This is a basic example.
Hello Sam!!!
I have tried to use HtmlCleaner and even JSoup. First of all I am not getting any full sample application of them. I was able to extract
This is a basic example.
Hello Sam!!!
using HTMLCleaner but could not extract the textbox value as it’s an attribute. Please help.
Here's an example, using JSoup, that shows how to extract attribute values from elements.

Extract text between html tags parsed from xml

Can anyone help me in extracting text from within the html tags to plain text?
I have parsed an xml and get some output as body which has html tags now i want to remove the tags and use the text.
thanks in advance!!!!
You can use HTML Parser like JSoup
For example
HTML is
<div style="height:240px;"><br>test: example<br>test1:example1</div>
You can get the html using
Document document = Jsoup.parse(html);
Element div = document.select("div[style=height:240px;]").first();
div.html();
Try a HTML Parser.
If the HTML is escaped, i.e. < instead of < you might have to decode first.
Considering your requirements you might try Jericho HTML Parser
Take a look at TextExtractor class:
Using the default settings, the source segment:
"<div><b>O</b>ne</div><div title="Two"><b>Th</b><script>//a script </script>ree</div>"
produces the text "One Two Three".
If all you want to do is remove HTML tags from a string, you can do this:
String output = input.replaceAll("(?s)\\<.*?\\>", " ");

Is there a simple java program that can extract URL & title of html files

Hi I am looking for a simple URL & title extractor from html files in Java. I am trying to parse bookmarks.html (IE,Firefox) etc and add the title & url to a db. I need to do this in java (no 3rd party libraries allowed) so proably I have to use sax/dom/regex.
You can load up the file into a DOM document and then use an XPath expression to find all the instances of an tag. Extracting the HREF attribute and the tag contents should do what you want to do. The XPath would probably be something as simple as '//A'.

Categories