HTML Parser to extract text out of the body (in java) - java

I am working on this project that requires me to carry out some text manipulation out of the text that I obtain from web pages.
Now, the first step towards doing this would be for me to find a parser that would extract the required body text ignoring the redundant information. I am not sure how I would do this, since I am extremely new to programming. I would really appreciate any help I could get.
Thanks in advance

I found this html parser very useful. It also provides a sample example . http://jericho.htmlparser.net/docs/index.html

I am just now doing it using HTMLParser, available at Sourceforge:
http://sourceforge.net/projects/htmlparser/
Seems very easy and straightforward, but since you claim to be new at this, here is an example with source code:
http://kickjava.com/src/org/htmlparser/parserapplications/StringExtractor.java.htm

Related

get part with regex

I need to get everything bewteen
onmouseout="this.style.backgroundColor='#fff'">
and the following <
in this case:
onmouseout="this.style.backgroundColor='#fff'">example<
I would like to get the word example.
Here is a more complicated example of where it should work as well:
onmouseout="this.style.backgroundColor='#fff'">going to drink?<br></span><span title="Juist!" onmouseover="this.style.backgroundColor='#ebeff9'" onmouseout="this.style.backgroundColor='#fff'">Exactly!</span></span></div></div>
So here i need 2 of them back (and not joined).
Could someone help? I suck at regex.
Someone edited my tag to javascript.
I need a solution to use in java, i just get a file as plain text. So javascript or html solutions are not really helpfull.
Regex with html? Well, If you have to parse only a few lines then ok. But in general is better to use a html parser (because HTML is not a regular language).
This is pure gold: https://stackoverflow.com/a/1732454/434171

How to get print out a document in Java with customized page size. (SE)

How to get print out a document (Which taken from data base or current fields form the form) in java with customized page size. Mostly important thing is I want to customize the page as my requirements (May be text alignment also needed). am Not a java hard coder. Your helps will me big help to me.
Thanks.
not clear what is (Which taken from data base or current fields form the form) , I suggest to go throught the 2D Graphics tutorial, there is detailed described Printing in Java
Everywhere I've worked that wanted well formatted output from a Java back-end we've deployed Apache FO (http://xmlgraphics.apache.org/fop/) which allowed us to use XSLT to convert XML to PDF. It works really well, but has a pretty steep learning curve.

Pretty print ("indentation-only") HTML documents in Java (no JTidy)

We're generating HTML files out of apaches velocity generic template engine. The generated HTML is kind of ugly and not with correcht indentation.
In my case I've got the HTML stored in a String which I want to manipulate in this way, that it looks pretty printed.
I've already gave JTidy a try, but it changes the HTML source code when I pipe the raw HTML trough it. Sometimes it adds or removes HTML tags.
My question:
Is there a java library or something else out there which (only!) pretty prints my HTML code without adding, removing tags from my HTML document? It shall only do the indentation, so that it looks pretty printed! Nothing more, nothing less. Any ideas? :-)
Also code suggestions, hints or tips are welcome.
Best regards
Maybe a little to late, but I found a solution to this with Jsoup.
you can get the "pretty" version of the html by using only the parser, and (in case of needed) avoid the generation of the html elements by using a "custom parser"
I got the answer from this Jsoup question
And its
public static String formatHTML(String html) throws Exception{
Document doc = Jsoup.parse(html, "", Parser.xmlParser());
return doc.toString();
}
I hope this helps.
Regards
Find any SAX parser example in java. indent++ for opening tags, intent-- for closing, and write content with counted intentation.
Why don't you write a simple Java parser to pretty print HTML yourself. Here is a sketch:
Track open and close tags for example and
have a counter to figure out the current indentation level.
Perhaps use a stack to push, pop the indentation level
Just iterate thru the HTML string and push the current indentation level on stack when you see a tag
If you see a nested tag then increment indentation level and keep going
When you see an end of tag e.g . etc then pop the stack to go back to prev indent level
I wanted to give you a rough idea here, you can use this as a starting point. I have written many perl based pretty printers. You could use Perl to script a parse fairly quickly..

Retrieving well formed HTML using Jericho HTML parser in Java

I've looked at jTidy for converting a snipped of malformed/real-world HTML into well-formed HTML/XHTML. However, there's a bug in the latest version due to which I'm not able to use it. I'm looking at Jericho since it has a lot of positive reviews around the net.
However, its not immediately obvious to me how one would go about implementing a method like:
public String getValidHTML(String messedUpHTML)
For instance, if it was passed <div>bar, it would return <div>bar</div>
Any pointers would be helpful.
Thanks in advance!
Jericho's HTMLSanitiser sample might be a good start.
However, keep in mind that jericho's key strength is its ability to parse and manipulate malformed HTML, while keeping the original "bad" formatting. However, it'd be interesting to see how the library performs such a task.

Is there a validating HTML parser implemented in Java?

I need to parse HTML 4 in Java.
Ideally I'd like an implementation that is SAX compatible.
I'm aware that there are numerous HTML parsers in for Java, however, they all seem to perform 'tidying'. In other words, they will correct badly formed HTML. I don't want this.
My requirements are:
No tidying.
If the input document is invalid HTML parsing should fail.
The document should be validatable against the HTML DTDs.
The parser can produce SAX2 events.
Is there a library that meets these requirements?
You can find a collection of HTML parsers here HTML Parsers. I don't remeber exactly but I think TagSoup parses the file without applying corrections...
I think the Jericho HTML Parser can deliver at least one of your core requirements ('If the input document is invalid HTML parsing should fail.') in that it will at least tell you if there are mismatched tags or other poisonous HTML flaws, and you can choose to fail based on this information.
Try typing invalid html into this Jericho formatting demo, and note the 'Parser Log' at the bottom of the page:
http://jerichohtmlparser.appspot.com/samples/FormatSource.jsp
So yes, this is doing tag tidying, but it is at least telling you about it - you can grab this information by setting a net.htmlparser.jericho.Logger (e.g. a WriterLogger or something more specific of your own creation) on your source, and then proceeding depending on what errors are logged out. This is a small example:
Source source=new Source("<a>I forgot to close my link!");
source.setLogger(myListeningLogger);
source.getSourceFormatter().writeTo(new NullWriter());
// myListeningLogger has now had all the HTML flaws written to it
In the example above, your logger's info() method is called with the string: 'StartTag at (r1,c1,p0) missing required end tag', which is relatively parseable, and you can always decide to just reject any HTML that logs any message worse than debug - in fact Jericho logs almost all errors as 'info' level, with a couple at 'warn' level (you might be tempted to create a small fork with the severities adjusted to correspond to what you care about).
Jericho is available on Maven Central, which is always a good sign:
http://mvnrepository.com/artifact/net.htmlparser.jericho/jericho-html
Good luck!
You may wish to check http://lobobrowser.org/cobra.jsp. They have a pure Java web browser (Lobo) implemented. They have the parser component (Cobra) pulled out separately for use. I honestly am not sure if it will do what you require with the "no tidying" requirement, but it may be worth a look. I ran across it when exploring the wild for a pure Java web browser.
You can try to subclass javax.swing.text.html.parser.Parser and implement the handleXXX() methods. It seems it doesn't try to fix the XML. See more at the API

Categories