Parse out Wikipedia markup from files in a directory

Parse out Wikipedia markup from files in a directory - java

I used lucene's ExtractWikipedia tool to extract a bz2 dump of the latest english wiki pages. The resulting .txt files still have the wikipedia markup language in them. Is there a tool or python script that I can run over the directory to only parse out the content from each file in the directory? (ie: modify the files so that they only contain content, no markup)
Alternatively, is there a java library or package which can accomplish this? I'm hoping to integrate it into the Lucene class, ExtractWikipedia.

you can try this a wikiprep it's a ready perl script that (you will need to install perl first )
removes the wikimarkup language
generate heirarchial categories
removes redirections
generates an XML format that's easily to parse
http://www.cs.technion.ac.il/~gabr/resources/code/wikiprep/
it may take some couple of hours to run over all wikipedia dumb
and may need a large memory about 6GB ram

Related

Create .vsdx files (Microsoft Visio) in Java

I'm looking for some info on how to create a .vsdx file in Java without any commercial libraries. According to other questions it seems to be pretty tough.
As a source we have a different, probably unknown file format called .epml that contains graphical information of EPCs which we should be able to convert to a .xml file. As far as I understand the .vsdx format so far, that's one of many files in the unzipped .vsdx required. I'd be glad if anyone could tell me about my options how to implement/create all the other files.
EDIT: The goal here is to be able to convert the graphic information of the .epml file so Visio is able to read & display it as in the source. Therefore, it doesn't have to be a .vsdx file if there are other possible options.
Thanks!

EPML is a not an unknown format, it is an interchange format for EPC tools. Just try to google it :)
I would suggest you convert your .epml files to .svg (there are free open source converters available, like epml2svg). Visio can read and show .svg files. Means - writing code does not seem to be required to achieve your goal (to convert .epml files to something Visio can show). AFAR there is online version of the tool as well - you upload EPML file, get back SVG, and just open it in Visio - that's it.
Side note - there are companies, like bpm-x for example, specializing in BPM tool-to-tool diagram conversion. Maybe they already have a solution for your original tool.
The .VSDX file is "office xml" format, that is also open and documented. But it's pretty tough to generate file from scratch, you are right. So in principle you could start with any code that is capable of handling open xml packages. Microsoft has OpenXML SDK, but that's .NET (MSDN HOWTO assumes you are using .NET, but explains basics of what the open xml package consists of)
AFAIK, for java, there are no open source visio libraries you could use. Java and Visio seem to live in parallel universes. The only viable commercial option I've heard of seem to be Aspose.

Interesting - whilst I cannot give a final answer, here are some thoughts:
Question 1: Why would you want to avoid commercial tools, when the final result file will require some - namely "Visio"?
1) Creating Visio files from XML:
Create template XMLs from a VSDX. Identify the files, that you need to edit. From what I've seen, these should be the masters and the pages files. Being able to make an XML from EPML, you should also know how to adapt it to a new structure.
This solution is probably by far the most tedious and less flexible.
2) Use Visio automation:
Presuming that the final document will need more than just graphics, namely shape data as well, an easier solution would consist of creating the graphics first
a) as SVG and import into Visio
b) even easier - automated drawing by Visio's automation capabilities (VBA, .Net, ...). The shapes to drop would already have been prepared as masters will all the relevant data and behaviour settings.
Then you would populate the data by means of one of the many data linking features (Wizard, Standard data linking, ODBC connections, etc.)

How do we get to know whether a macro is present or not in a word document?

Using java how to find whether a macro is present or not in a Microsoft word document. Tried using any switch command for WinWord.exe but there is no switch so that we can find it.

Use a library that can parse Word documents. Apache POI is good choice as long as the documents aren't too big.
The library allows you to load the document. Afterwards, you can examine the various parts.
Bug 52949 has an attachment with sample code how to extract Macro code. This should get you started.
You you're using the new XML format .docx / OOXML, then the word file is in fact a ZIP archive that you can unpack using the standard Java library. Inside, you will find a lot of XML files. The macros should be there as well.

How to create .mpp file in java?

I am able to create .mpx file by using mpxj library in java.
I need write ( create ) .mpp file in java can any one suggest me please.

I maintain MPXJ, and the short answer to your enquiry is that, at present, MPXJ does not write MPP files.
The main reason for this is simply that despite the effort which has gone into understanding the MPP file structure, there is still a great deal of it which is not well understood, hence it is difficult to reliably generate. The other issue is that even if I was to produce some code which could generate an MPP file, the features it could write to that file are likely to lag behind what MPXJ supports in the MSPDI file format, again due to my incomplete understanding of the MPP format.
My suspicion is that the next version of MS project (project 15? Project 2013?) may probably offer a ".mppx" file format, similar to the ".docx" etc formats used by other applications in the MS Office suite. This will be XML-based and will be more straightforward to generate than the binary MPP file format currently is... let's see what Microsoft come up with!
Jon

Visit http://www.mpxj.org/faq/
Can I use MPXJ to write MPP files?
Not at present. Although it is technically feasible to generate an MPP file, the knowledge we have of the file structure is still relatively incomplete, despite the amount of data we are able to correctly extract. It is therefore likely to take a considerable amount of development effort to make this work, and it is conceivable that we will not be ablet to write the full set of attributes that MPXJ supports back into the MPP file - simply because we don't understand the format well enough. You are therefore probably better off using MSPDI which does support the full range of data items present in an MPP file.
You can
Try this: http://www.aspose.com/java/project-management-component.aspx
It writes MPP and Microsoft Project XML.
But this not free

Try this: http://www.aspose.com/java/project-management-component.aspx
It writes MPP and Microsoft Project XML.

I think by "mpp" you probably mean "Microsoft PowerPoint", correct?
Q: Why do you think MPXJ (Microsoft Project Exchange/Java) can't do this?
http://www.mpxj.org/
Welcome to MPXJ! This library provides a set of facilities to allow
project information to be manipulated in Java and .Net. MPXJ supports
a range of data formats: Microsoft Project Exchange (MPX), Microsoft
Project (MPP,MPT), Microsoft Project Data Interchange (MSPDI XML),
Microsoft Project Database (MPD), Planner (XML), Primavera (PM XML,
XER, and database), and Asta Powerproject (PP, MDB).

Parsing structured documents in Java

I would like to parse some legal documents with a Java library into pieces of text that represent headers, paragraphs etc. Legal documents are usually well-structured, so I would like to use something a bit easier than JavaCC (or other parser generators). Are there any which would allow to (nearly) automatically detect such a structure?
Thanks.

I think there is no tool that can "nearly automatically" extract such structures. If it is realy easy to extract the structure you would not need any tool, you can easely code it yourself. If it is not so easy you need a tool that is powerfull enough (JavaCC, ANTLR ...).
I think parsing the text yourself with custom code is the best way. Maybe read beforehand a bit about parsing (recursive decent, lexer/parser seperation...). For simple structures it is not hard to get a working solution quickly.

Apache POI - the Java API for Microsoft Documents
Apache PDFBox - Java PDF Library
easier one will be Apache Tika - a content analysis toolkit, toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries.
it uses pdfbox and poi internally
use: java -jar tika-app-0.9.jar [option] [file] -t
will parse the file(s) specified on the
command line and output the extracted text content

How do I use Apache POI to read a .DOC file in Java to separate images from text?

I need to read a Word .doc file from Java that has text and images. I need to recognize the images & text and separate them into 2 files.
I've recently heard about "Apache POI." How I can use Apache POI to read Word .doc files?

The examples and sample code on apache's site are pretty good. I recommend you start there.
http://poi.apache.org/hwpf/quick-guide.html
To get specific bits of text, first create a org.apache.poi.hwpf.HWPFDocument. Fetch the range with getRange(), then get paragraphs from that. You can then get text and other properties.
Here for an example of extracting an image. Here for the latest revision as of this writing.
And of course, the Javadocs
Note that, according to the POI site,
HWPF is still in early development.

It's not free (or even cheap!) but Aspose.Words should be able to do this. Their evaluation download will let you play with small files.
Do the destination files also have to be Docs? You could open the docs in Office and save them out as HTML. Then the separation becomes trivial. RTF is also a viable option, but I can't recommend a good RTF parser off the top of my head.
Edit to say: I just remembered another possible solution: Jacob, but you'll need an instance of Office running on the same machine. It's short for Java COM Bridge and it lets you make calls to the COM libraries in Office to manipulate the documents. I'm sure it's not as scary as it might sound!

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Parse out Wikipedia markup from files in a directory - java

Related

Create .vsdx files (Microsoft Visio) in Java

How do we get to know whether a macro is present or not in a word document?

How to create .mpp file in java?

Parsing structured documents in Java

How do I use Apache POI to read a .DOC file in Java to separate images from text?

Categories

Resources