Java: Parse html file and extract text

Java: Parse html file and extract text - java

I want to parse an HTML file and store the bold text (inside <b> tags). One solution is to read the file line by line and split or use RegEx. This means that I should store the entire page in a String variable? If I don't save it in a variable then I have no guarantee that the start of the tag and the end of it are on the same line.
What solution do you suggest?

Use JSoup to parse the contents
String html = "<html><head><title>First parse</title></head>"
+ "<body><p>Parsed HTML into a doc.</p></body></html>";
Document doc = Jsoup.parse(html);

it is a project I have for university
Use HTMLEditorKit.ParserCallback

Related

How to write data to pdf file which contains html tags using itext lib in Java

I have String which contains some html tags and it is coming from database, i want to write that in PDF file with same styling present in the String in the form of HTML tag. I tried to use XMLWorkerHelper like this
String html = What is the equation of the line passing through the
point (2,-3) and making an angle of -45<sup>2</sup> with the positive
X-axis?
XMLWorkerHelper.getInstance().parseXHtml(writer, document, new
StringReader(html));
but it only reads the data which is inside the html tag(in this case only 2) other string it simply ignores. But i want the entire String with HTML formating.
With HTMLWorker it works perfectly but that is deprecated so please let me know how to achieve this.
I am using iText 5 lib

Avoid removal of spaces and newline while parsing html using jsoup

I have a sample code as below.
String sample = "<html>
<head>
</head>
<body>
This is a sample on parsing html body using jsoup
This is a sample on parsing html body using jsoup
</body>
</html>";
Document doc = Jsoup.parse(sample);
String output = doc.body().text();
I get the output as
This is a sample on parsing html body using jsoup This is a sample on `parsing html body using jsoup`
But I want the output as
This is a sample on parsing html body using jsoup
This is a sample on parsing html body using jsoup
How do parse it so that I get this output? Or is there another way to do so in Java?

You can disable the pretty printing of your document to get the output like you want it. But you also have to change the .text() to .html().
Document doc = Jsoup.parse(sample);
doc.outputSettings(new Document.OutputSettings().prettyPrint(false));
String output = doc.body().html();

The HTML specification requires that multiple whitespace characters are collapsed into a single whitespace. Therefore, when parsing the sample, the parser correctly eliminates the superfluous whitespace characters.
I don't think you can change how the parser works. You could add a preprocessing step where you replace multiple whitespaces with non-breakable spaces ( ), which will not collapse. The side effect, though, would of course be that those would be, well, non-breakable (which doesn't matter if you really just want to use the rendered text, as in doc.body().text()).

Delete tabulation character from text retrieved with Jsoup

I'm parsing a HTML file using Jsoup. When getting the text of a h1 it retrieves also tabulations and newlines.
'Name' is what I'm trying to retreive from here:
<h1>\n\t\t\tNAME\n\t\t</h1>
I'm trying to get rid of these characters this way:
String name = document.select( "div header > h1" ).first().ownText().replaceAll( "[^a-zA-Z]+", "" ).trim().toUpperCase();
But this is the result:
NTTTTNAMETNTTT
How can I get the text without all the tabulations and newlines characters?

It seems that the html really contains the strings "\t" and "\n" literally. In that case you probably should clean the source prior to parsing. Something like this should do:
String html = Jsoup.connect(URL).userAgent("Mozilla/5.0").execute().body();
html = html.replaceAll("\\\\[nt]", "");
Document doc = Jsoup.parse(html);

Parsing HTML and get all the nodes

I need to parse an HTML file in java. Unlike XML there is no repetitive tags. So I need a code that can parse the html file and reach all nodes, it includes nested tags .. etc. The HTML code is not fixed. In other words given any HTML code I need to reach all the tags in the HTML.

try this HTML Parser
http://htmlparser.sourceforge.net/samples.html

I think you need this...
var els=document.getElementsByTagName("*");
for(var i=0;i<els.length;i+)document.write(els.nodeName+"<br />");

Extract text between html tags parsed from xml

Can anyone help me in extracting text from within the html tags to plain text?
I have parsed an xml and get some output as body which has html tags now i want to remove the tags and use the text.
thanks in advance!!!!

You can use HTML Parser like JSoup
For example
HTML is
<div style="height:240px;"><br>test: example<br>test1:example1</div>
You can get the html using
Document document = Jsoup.parse(html);
Element div = document.select("div[style=height:240px;]").first();
div.html();

Try a HTML Parser.
If the HTML is escaped, i.e. < instead of < you might have to decode first.

Considering your requirements you might try Jericho HTML Parser
Take a look at TextExtractor class:
Using the default settings, the source segment:
"<div><b>O</b>ne</div><div title="Two"><b>Th</b><script>//a script </script>ree</div>"
produces the text "One Two Three".

If all you want to do is remove HTML tags from a string, you can do this:
String output = input.replaceAll("(?s)\\<.*?\\>", " ");

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java: Parse html file and extract text - java

Use JSoup to parse the contents String html = "<html><head><title>First parse</title></head>" + "<body><p>Parsed HTML into a doc.</p></body></html>"; Document doc = Jsoup.parse(html);

it is a project I have for university Use HTMLEditorKit.ParserCallback

Related

How to write data to pdf file which contains html tags using itext lib in Java

Avoid removal of spaces and newline while parsing html using jsoup

Delete tabulation character from text retrieved with Jsoup

Parsing HTML and get all the nodes

Extract text between html tags parsed from xml

Categories

Resources