Is there a solution to parse wikipedia xml dump file in Java? - java

I am trying to parse this huge 25GB Plus wikipedia XML file. Any solution that will help would be appreciated. Preferably a solution in Java.

A Java API to parse Wikipedia XML dumps: WikiXMLJ (Last update was at Nov 2010).
Also, there is alive mirror that is maven-compatible with some bug fixes.

Ofcourse it's possible to parse huge XML files with Java, but you should use the right kind of XML parser - for example a SAX parser which processes the data element by element, and not a DOM parser which tries to load the whole document into memory.
It's impossible to give you a complete solution because your question is very general and superficial - what exactly do you want to do with the data?

Here is an active java project that may be used to parse wikipedia xml dump files:
http://code.google.com/p/gwtwiki/. There are many examples of java programmes to transform wikipedia xml content into html, pdf, text, ... : http://code.google.com/p/gwtwiki/wiki/MediaWikiDumpSupport
Massi

Yep, right. Do not use DOM. If you want to read small amount of data only, and want to store in your own POJO then you can use XSLT transformation also.
Transforming data into XML format which is then converted to some POJO using Castor/JAXB (XML to ojbect libraries).
Please share how you solve the problem so others can have better approach.
thanks.
--- EDIt ---
Check the links below for better comparison between different parsers. It seems that STAX is better because it has control over the parser and it pulls data from parser when needed.
http://java.sun.com/webservices/docs/1.6/tutorial/doc/SJSXP2.html
http://tutorials.jenkov.com/java-xml/sax-vs-stax.html

If you don't intend to write or change anything in that xml, consider using SAX. It keeps in memory one node at a time (instead of DOM, which tries to build the whole tree in the memory).

I would go with StAX as it provides more flexibility than SAX (also good option).

I had this problem some days ago I found out that the wiki parser provided by https://github.com/Stratio/wikipedia-parser does the work.
They stream the xml file and read it in chunks which you can then capture in callbacks.
This is a snippet of how I used it in Scala:
val parser = new XMLDumpParser(new BZip2CompressorInputStream(new BufferedInputStream(new FileInputStream(pathToWikipediaDump)), true))
parser.getContentHandler.setRevisionCallback(new RevisionCallback {
override def callback(revision: Revision): Unit = {
val page = revision.getPage
val title = page.getTitle
val articleText = revision.getText()
println(articleText)
}
It streams the wikipedia, parses it, and everytime it finds a revision(Article) it will get its title,text and print the article's text. :)
--- Edit ---
Currently I am working on https://github.com/idio/wiki2vec which I think does part of the pipeline which you might need.
Feel free to take a look at the code

Related

What's the right way to produce a XML content in Java?

I've read several questions and tutorials over the internet such as
Best XML parser for Java [closed]
JAVA XML - How do I get specific elements in an XML Node?
What is the best way to create XML files in Java?
how to modify xml tag specific value in java?
Using StAX - From Oracle Tutorials
Apache Xerces Docs
Introduction to XML and XML With Java
Java DOM Parser - Modify XML Document
But since this is the very first time I have to manipulate XML documents in Java I'm still a little bit confused. The XML content is written with String concatenation and that seems to me wrong. It is the same to concatenate Strings to produce a JSON object instead of using a JSONObject class. That's the way the code is written right now:
"<msg:restenv xmlns:msg=\"http://www.b2wdigital.com/umb/NEXM_processa_nf_xml_req\" xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" xsi:schemaLocation=\"http://www.b2wdigital.com/umb/NEXM_processa_nf_xml_req req.xsd\"><autenticacao><usuario>"
+ usuario + "</usuario><senha>" + StringUtils.defaultIfBlank(UmbrellaRestClient.PARAMETROS_INFRA_UMBRELLA.get("SENHA_UMBRELLA"), "WS.INTEGRADOR")
+ "</senha></autenticacao><parametros><parametro><p_vl_gnre>" + valorGNRE + "</p_vl_gnre><p_cnpj_destinatario>" + cnpjFilial + "</p_cnpj_destinatario><p_num_ped_compra>" + idPedido
+ "</p_num_ped_compra><p_xml_sefaz><![CDATA[" + arquivoNfe + "]]></p_xml_sefaz></parametro></parametros></msg:restenv>"
In my research I think that almost everything I've read pointed to SAX as the best solution but I never really found something really useful to read about it, almost everything states that we have to create a handler and override startElement, endElement and characters methods.
I don't have to serialize the XML file in hard disk or database or anything else, I just need to pass its content to a webservice.
So my question really is, which is the right way to do it?
Concatenate Strings the way things are done right now?
Write the XML file using a Java API like Xerces? I have no clue on how that can be done.
Read the XML file with streams and just change node texts? My XML without the files would be like that:
<msg:restenv xmlns:msg="{url}"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="{schemaLocation}">
<autenticacao>
<usuario></usuario>
<senha></senha>
</autenticacao>
<parametros>
<parametro>
<p_vl_gnre></p_vl_gnre>
<p_cnpj_destinatario></p_cnpj_destinatario>
<p_num_ped_compra></p_num_ped_compra>
<p_xml_sefaz><![CDATA[]]></p_xml_sefaz>
</parametro>
</parametros>
</msg:restenv>
I've also read something about using Apache Velocity as a template Engine since I don't actually have to serialize the XML and that's a approach that I really like because I've already worked with this framework and it's a really simple framework.
Again, I'm not looking for the best way, but the right one with tutorials and examples, if possible, on how to get things done.
It all depends on context. There is no single "right way".
The biggest issues with concatenation is the combination of escaping the XML in to a String constant (which is fiddly), but also escaping the values that you can using so that they're correct for an XML document.
For small XMLs, that's fine.
But for larger ones, it can be a pain.
If most of your XML is boilerplate with just a few values inserted, you may find that templates using something like Velocity or any of the other several libraries may be quite effective. It helps keep the template "native" (you don't have to wrap it in "'s and escape it), plus it keeps the XML out of your code, but easily lets you stamp in the parts that you need to do.
I agree that there's not just one way to do it, but I would advise you to take a look at JAXB. You can easily consume and produce XML without any of that pesky String manipulation. Look here for a simple intro: https://docs.oracle.com/javase/tutorial/jaxb/index.html
The Answer by Will Hartung is correct. There is not one right way as it depends on your situation.
For a beginner programmer, I suggest writing the strings manually so you get to understand XML in general and your content in particular. As for the mechanics of String concatenation, you would generally be using StringBuilder rather than String for better performance. Where thread-safety is needed, use StringBuffer.
The major issue is memory.
Abundant MemoryIf you have lots of memory and small XML documents, you can load the entire document into memory. This way you can traverse a document forwards, backwards, and jump around arbitrarily. This approach is know as Document Object Model (DOM). One better-known implementation of this approach is Apache Xerces. There are other good implementations as well.
Scarce MemoryIf you have little memory and large XML documents, then you need to plow through the document from start to finish, biting off small chunks at a time for lower memory usage. This approach is known as SAX. You can find multiple good implementations.
Another issue is validation. Do you want to validate the XML documents against a DTD or Schema? Some tools do this and some do not.
When all you need is to serialize the content of a Java object and read it back, I recommend the Simple XML Serialization library. Much simpler with a quicker learning-curve than the other tools.

Efficient way to read a small part of a BIG XML file in Java

We have a new requirement:
There are some BIG xml files keep coming into our system and we will need to process them immediately and quickly using Java. The file is huge but the required information for our processing is inside a element which is very small.
...
...
What is the best way to extract this small portion of the data from the huge file before we start processing. If we try to load the entire file, we will get out of memory error immediately due to size. What is the efficient way in Java that I can use to get the ..data..data..data.. data element without loading or reading the file line by line. Is there any SAX Parser that I can use to get this done?
Thank you
The SAX parsers are event based and are much faster because they do what you need: they don't read the xml document entirely. There is a SAXParser available in the Java distributions.
I had to parse huge files in a previous project (1G-2G) and didn't want to deal with using SAX. I find SAX too low-level in some instances and like keepings a traversal approach in most cases.
I have used the VTD library http://vtd-xml.sourceforge.net/. It's an EXTREMELY fast library that uses pointers to navigate through the document.
Well, if you want to read a part of a file, you will need to read each line of the file to be able to identify the part of the file of interest and then extract what you need.
If you only need a small portion of the incoming XML, you can either use SAX, or if you need to read only specific elements or attributes, you could use XPath, which would be a lot simpler to implement.
Java comes with a built-in SAXParser implementation as well as an XPath implementation. Find the javadocs for SAXParser here and for XPath here.
StAX is another option based on steaming the data, like SAX, but benefits from a more friendly approach (IMO) to processing the data by "pulling" what you want rather than having it "pushed" to you.

Android XML parser for simple xml node strings

I need to parse a series of simple XML nodes (String format) as they arrive from a persistent socket connection. Is a custom Android SAX parser really the best way? It seams slightly overkill to do it in this way
I had naively hoped I could cast the strings to XML then reference the names / attributes with dot syntax or similar.
I'd use the DOM Parser. It isn't as efficient as SAX, but if it's a simple XML file that's not too large, it's the easiest way to get up and moving.
Great tutorial on how to use it here: http://tutorials.jenkov.com/java-xml/dom.html
You might want to take a look at the XPath library. This is a more simple way of parsing xml. It's similar to building SQL queries and regex's.
http://www.ibm.com/developerworks/library/x-javaxpathapi.html
I'd go for a SAX Parser:
It's much more efficient in terms of memory, especially for larger files: you don't parse an entire document into objects, instead the parser performs a single uni-directional pass over the document and triggers events as it goes through.
It's actually surprisingly easy to implement: for instance take a look at Working with XML on Android by IBM. It's only listings 5 and 6 that are the actual implementation of their SAX parser so it's not a lot of code.
You can try to use Konsume-XML: SAX/STAX/Pull APIs are too low-level and hard to use; DOM requires the XML to fit into memory and is still clunky to use. Konsume-XML is based on Pull and therefore it's extremely efficient, yet the API is higher-level and much easier to use.

Parsing a xml file using Java

I need to parse a xml file using JAVA and have to create a bean out of that xml file after parsing .
I need this while using Spring JMS in which producer is producing a xml file .First I need to read the xml file and take action according .
I read some thing about parsing and come with these option
xpath
DOM
Which ll be the best option to parse the xml file.
did you check JAXB
There's three ways of parsing an XML file, SAX, DOM and StAX.
DOM will parse the whole file and build up a tree in memory - great for small files but obviously if this is huge then you don't want the entire tree just sitting in memory! SAX is event based - it doesn't load anything into memory per-se but just fires off a series of events as it reads through the file. StAX is a median between the two, the application moves the cursor forward as it needs, grabbing the data as it goes (so no event firing or huge memory consumption.)
What one you use will really depend on your application - all have built in libraries since Java 6.
Looks like, you receive a serialized object via Java messaging. Have a look first, how the object is being serialized. Usually this is done with a library (jaxb, axis, ...) and you could use the very same library to create a deserializer.
You will need:
The xml schema (a xsd file)
The Java bean class (very helpful, it should exist)
Then, usually the library will create all helper classes and files and you don't have to care about parsing.
if you need to create an object, just extract the needed properties and go on...
I recommend using StaX, see this tutorial for more information.
Umh..there are several ways you can parse an xml document to into memory and work with it. You mentioned DOM. DOM actually holds uploads the whole document into memory and then allows you to move between different branches of the XML document.
On the other hand, you could use StAX. It works similar to DOM. The only difference is that, it streams the content of the XML document thus allowing better allocation of memory. On the other hand, it does not retain the information that has already been read.
Look at : http://download.oracle.com/javaee/5/tutorial/doc/bnbem.html It gives details about both parsing methods and example code. Hope that helps.

What's the best way to retrieve two pieces of data from an XML file?

I've got an XML document that is in either a pre or post FO transformed state that I need to extract some information from. In the pre-case, I need to pull out two tags that represent the pageWidth and pageHeight and in the post case I need to extract the page-height and page-width parameters from a specific tag (I forget which one it is off the top of my head).
What I'm looking for is an efficient/easily maintainable way to grab these two elements. I'd like to only read the document a single time fetching the two things I need.
I initially started writing something that would use BufferedReader + FileReader, but then I'm doing string searching and it gets messy when the tags span multiple lines. I then looked at the DOMParser, which seems like it would be ideal, but I don't want to have to read the entire file into memory if I could help it as the files could potentially be large and the tags I'm looking for will nearly always be close to the top of the file. I then looked into SAXParser, but that seems like a big pile of complicated overkill for what I'm trying to accomplish.
Anybody have any advice? Or simple implementations that would accomplish my goal? Thanks.
Edit: I forgot to mention that due to various limitations I have, whatever I use has to be "builtin" to core Java, in which I can't use and/or download any 3rd party XML tools.
While XPath is very good for querying XML data, I am not aware of good and fast XPath implementation for Java (they all use DOM model at least).
I would recommend you to stick with StAX. It is extremely fast even for huge files, and it's cursor API is rather trivial:
XMLInputFactory f = XMLInputFactory.newInstance();
XMLStreamReader r = f.createXMLStreamReader("my.xml");
try {
while (r.hasNext()) {
r.next();
. . .
}
} finally {
r.close()
}
Consult StAX tutorial and XMLStreamReader javadocs for more information.
You can use XPath to search for your tags. Here is a tutorial on forming XPath expressions. And here is an article on using XPath with Java.
An easy to use parser (dom, sax) is dom4j. It would be quite easier to use than the built-in SAXParser.
try "XMLDog"
This uses sax to evaluate xpaths

Categories