How do you parse html-docs in Java? I've read a lot of articles about parsing, but haven't found the best way to do it.
Try using Jsoup https://jsoup.org/. It is one of the most widely used html parser.
Related
I have bunch of web document and want to remove the html tags from it. I saw some posts on StackOverflow on how to do in java, all from regex to HtmlCleaner and Jsoup.
I am interested in finding the fastest way to do it. I have millions of documents, so performance is crucial in my case. I can even trade a bit of quality for the performance.
Thanks for any answers in advance.
My opinion is to use as much as possible stream/SAX processing:
1) because it uses less memory
2) it is fast
3) can be more easier parallelized (consequence of low memory consumption)
Those factors are needed (from my pov) by your use cases where you have million of documents.
please see there Wikipedia SAX
So if your Html is strict or XHTML. Use XSLT, and here is a tuto on how to transform XML (XHTML) using SAX XSLT+SAX+Java.
And finally, if you DON'T have an XML valid HTML please, look at this Java: Replace Strings in Streams, Arrays, Files etc. which make use of stream (and PushBackReader).
HTH
1) if html is proper xml then you can create its document object and remove the node.
2) if it is not proper xml then read entire html as string & and use replace function to remove "html" sunbstring.
If HTMl is not proper xml then regex is fastest way to replace in a string.
Seems like the java regexp is the fastest solution. However, it degrades the quality of the text obtained after.
look at the Google map
Is there any parser to get the link(www.google.com/map) from the <a> tag?
or the best way just to write a custom one~
jQuery, for instance:
var href = $('a.more-link').attr('href');
There is many 3:rd party solutions but I am not sure which exist for Java, maybe HTML agility pack exists in a version for Java.
But another solution would be to use regex
/<a\s+[^<]*?href\s*=\s*(?:(['"])(.+?)\1.*?|(.+?))>/
Fixed the regex to handle problems suggested in comments.
Looked up some real HTML parsers for Java if you find you need more than the regex aproach
http://htmlparser.sourceforge.net/
http://jericho.htmlparser.net/docs/index.html
http://jsoup.org/
I am trying to parse this huge 25GB Plus wikipedia XML file. Any solution that will help would be appreciated. Preferably a solution in Java.
A Java API to parse Wikipedia XML dumps: WikiXMLJ (Last update was at Nov 2010).
Also, there is alive mirror that is maven-compatible with some bug fixes.
Ofcourse it's possible to parse huge XML files with Java, but you should use the right kind of XML parser - for example a SAX parser which processes the data element by element, and not a DOM parser which tries to load the whole document into memory.
It's impossible to give you a complete solution because your question is very general and superficial - what exactly do you want to do with the data?
Here is an active java project that may be used to parse wikipedia xml dump files:
http://code.google.com/p/gwtwiki/. There are many examples of java programmes to transform wikipedia xml content into html, pdf, text, ... : http://code.google.com/p/gwtwiki/wiki/MediaWikiDumpSupport
Massi
Yep, right. Do not use DOM. If you want to read small amount of data only, and want to store in your own POJO then you can use XSLT transformation also.
Transforming data into XML format which is then converted to some POJO using Castor/JAXB (XML to ojbect libraries).
Please share how you solve the problem so others can have better approach.
thanks.
--- EDIt ---
Check the links below for better comparison between different parsers. It seems that STAX is better because it has control over the parser and it pulls data from parser when needed.
http://java.sun.com/webservices/docs/1.6/tutorial/doc/SJSXP2.html
http://tutorials.jenkov.com/java-xml/sax-vs-stax.html
If you don't intend to write or change anything in that xml, consider using SAX. It keeps in memory one node at a time (instead of DOM, which tries to build the whole tree in the memory).
I would go with StAX as it provides more flexibility than SAX (also good option).
I had this problem some days ago I found out that the wiki parser provided by https://github.com/Stratio/wikipedia-parser does the work.
They stream the xml file and read it in chunks which you can then capture in callbacks.
This is a snippet of how I used it in Scala:
val parser = new XMLDumpParser(new BZip2CompressorInputStream(new BufferedInputStream(new FileInputStream(pathToWikipediaDump)), true))
parser.getContentHandler.setRevisionCallback(new RevisionCallback {
override def callback(revision: Revision): Unit = {
val page = revision.getPage
val title = page.getTitle
val articleText = revision.getText()
println(articleText)
}
It streams the wikipedia, parses it, and everytime it finds a revision(Article) it will get its title,text and print the article's text. :)
--- Edit ---
Currently I am working on https://github.com/idio/wiki2vec which I think does part of the pipeline which you might need.
Feel free to take a look at the code
I've got an XML document that is in either a pre or post FO transformed state that I need to extract some information from. In the pre-case, I need to pull out two tags that represent the pageWidth and pageHeight and in the post case I need to extract the page-height and page-width parameters from a specific tag (I forget which one it is off the top of my head).
What I'm looking for is an efficient/easily maintainable way to grab these two elements. I'd like to only read the document a single time fetching the two things I need.
I initially started writing something that would use BufferedReader + FileReader, but then I'm doing string searching and it gets messy when the tags span multiple lines. I then looked at the DOMParser, which seems like it would be ideal, but I don't want to have to read the entire file into memory if I could help it as the files could potentially be large and the tags I'm looking for will nearly always be close to the top of the file. I then looked into SAXParser, but that seems like a big pile of complicated overkill for what I'm trying to accomplish.
Anybody have any advice? Or simple implementations that would accomplish my goal? Thanks.
Edit: I forgot to mention that due to various limitations I have, whatever I use has to be "builtin" to core Java, in which I can't use and/or download any 3rd party XML tools.
While XPath is very good for querying XML data, I am not aware of good and fast XPath implementation for Java (they all use DOM model at least).
I would recommend you to stick with StAX. It is extremely fast even for huge files, and it's cursor API is rather trivial:
XMLInputFactory f = XMLInputFactory.newInstance();
XMLStreamReader r = f.createXMLStreamReader("my.xml");
try {
while (r.hasNext()) {
r.next();
. . .
}
} finally {
r.close()
}
Consult StAX tutorial and XMLStreamReader javadocs for more information.
You can use XPath to search for your tags. Here is a tutorial on forming XPath expressions. And here is an article on using XPath with Java.
An easy to use parser (dom, sax) is dom4j. It would be quite easier to use than the built-in SAXParser.
try "XMLDog"
This uses sax to evaluate xpaths
What would be the correct way to find a string like this in a large xml:
<ser:serviceItemValues>
<ord1:label>Start Type</ord1:label>
<ord1:value>Loop</ord1:value>
<ord1:valueCd/>
<ord1:activityCd>iactn</ord1:activityCd>
</ser:serviceItemValues>
1st in this xml there will be a lot of repeats of the element above with different values (Loop, etc.) and other xml elements in this document. Mainly what I am concerned with is if there is a serviceItemValues that does not have 'Loop' as it's value. I tried this, but it doesn't seem to work:
private static Pattern LOOP_REGEX =
Pattern.compile("[\\p{Print}]*?<ord1:label>Start Type</ord1:label>[\\p{Print}]+[^(Loop)][\\p{Print}]+</ser:serviceItemValues>[\\p{Print}]*?", Pattern.CASE_INSENSITIVE|Pattern.MULTILINE);
Thanks
Regular expressions are not the best option when parsing large amounts of HTML or XML.
There are a number of ways you could handle this without relying on Regular Expressions. Depending on the libraries you have at your disposal you may be able to find the elements you're looking for by using XPaths.
Heres a helpful tutorial that may help you on your way: http://www.totheriver.com/learn/xml/xmltutorial.html
Look up XPath, which is kinda like regex for XML. Sort of.
With XPath you write expressions that extract information from XML documents, so extracting the nodes which don't have Loop as a sub-node is exactly the sort of thing it's cut out for.
I haven't tried this, but as a first stab, I'd guess the XPath expression would look something like:
"//ser:serviceItemValues/ord1:value[text()!='Loop']/parent::*"
Regular expression is not the right tool for this job. You should be using an XML parser. It's pretty simple to setup and use, and will probably take you less time to code. It then will come up with this regular expression.
I recommend using JDOM. It has an easy syntax. An example can be found here:
http://notetodogself.blogspot.com/2008/04/teamsite-dcr-java-parser.html
If the documents that you will be parsing are large, you should use a SAX parser, I recommend Xerces.
When dealing with XML, you should probably not use regular expressions to check the content. Instead, use either a SAX parsing based routine to check relevant contents or a DOM-like model (preferably pull-based if you're dealing with large documents).
Of course, if you're trying to validate the document's contents somehow, you should probably use some schema tool (I'd go with RELAX NG or Schematron, but I guess you could use XML Schema).
As mentioned by the other answers, regular expressions are not the tool for the job. You need a XPath engine. If you want to these things from the command line though, I recommend to install XMLStar. I have very good experience with this tool and solving various XML related tasks. Depending on your OS you might be able to just install the xmlstarlet RPM or deb package. Mac OS X ports includes the package as well I think.