Sample code to use cTakes AggregatePlaintextUMLSProcessor - java

I'm new to Java and UIMA, and I can't find a comprehensive sample to use the AggregatePlaintextUMLSProcessor from code and print results in a proper format.
I managed to run cTakes from command line and I see it's using FileWriterCasConsumer.xml to write the output, but I want to know what other formats can I get.
I'va got the code for apache-ctakes-3.2.2 and it's building on a windows 10 machine.

Stock cTakes supports writing data to:
XMI,
CSV,
Plain text and
HTML files.
You could take a look at sub packages org.apache.ctakes.core.cc.* in the ctakes-core module.

Related

Preprocess OpenDoPE Word file (Macro or docx4j)

I have recently discovered the OpenDoPE project. From what I understand from the walkthrough, .docx files must be preprocessed to replace repeatable contents for example.
If I understand well there are 2 ways to do it :
Using docx4j
Using a Macro
I am developing a rails web platform, and I'd prefer the preprocessing to be done client-side, so with the Macro. But then If I can only do it with java, I'll go with it
Problem : when I click the "inject macro" button in the OpenDop Add-in in Word2010, nothing happens :O
So two possible answers :
Explain how I can install this macro in the document
Explain how I can have docx4j to preprocess the document. ie : from a linux terminal, what command with what parameters should I type to preprocess some document.docx file containing repeatable-contents ?
I tried clicking the "inject macro" button in my Word 2010, and it worked, that is:
it prompted me to save a .docm file
when i opened the .docm file in Word, the macro ran
Trying to open the macro in Word's VBA editor though, I couldn't. Seems I obfuscated it :-(
I do have the source files floating around, which I'd be happy to put on GitHub.
Please note however, that it is 4yo unmaintained 'proof of concept' level code (whereas the docx4j code is actively maintained and used by a variety of companies).
For non-interactive processing using Java, see samples/ContentControlBindingExtensions.java
To invoke from a Linux command line, that would need modifying slightly; also you need of course to pass a suitable class path.
The other way you could do it is by installing this simple web app in say Tomcat.

Is there CMU Sphinx local lmtool for java?

I want to convert words to its Arpabet translation.
Something like:
HELLO HH AH L OW
But I want to do it programatically in java, sphinx offers a web tool here http://www.speech.cs.cmu.edu/tools/lmtool.html. I know I can request this tool in Java using sockets and sniffing out the .dic file returned but I cannot use this because not all users of my app has an internet connection.
I also checked-out logios package of Sphinx but it is written in perl and batch files. I can use it but I want to make my app platform-independent and I think it is a bit overboard if I include perl shell in my project.
If there's any java library or algorithm that I can reuse so I can just do something like ConvertToSphinxArpabet("HELLO") and I get the "HH AH L OW" string returned.
Please check the tutorial:
http://cmusphinx.sourceforge.net/wiki/tutorialdict
For example you can use g2p code from FreeTTS written in Java:
http://cmusphinx.sourceforge.net/projects/freetts
OpenMary Java TTS:
http://mary.dfki.de/
For FreeTTS example see our code in the long audio aligner branch:
http://cmusphinx.svn.sourceforge.net/viewvc/cmusphinx/branches/long-audio-aligner/Aligner/src/edu/cmu/sphinx/linguist/dictionary/AllWordDictionary.java?revision=11092&view=markup

How can I pretty-print Java source code as a PDF?

I'm planning to put some Java code in an appendix to my report. The report is a PDF document, and I use Eclipse for Java.
How can I present it best and do this easily? Any recommendations?
For this purpose, I created a LaTeX doclet. This is a Javadoc doclet, which converts the javadoc comments to LaTeX code, and (if wanted) also includes a pretty-printed version of the source code of the documented methods.
You can then convert the generated LaTeX document to PDF, and append it to your report.
If you use Windows, install CutePDF. This adds a "Printer" that when you print to it it asks you a file name and then prints the output to a .pdf document on your hard drive - hence it is a psuedo printer - it acts like a printer, but is really a pdf file writer.
Don't know solutions for other o/s...
I usually prefer to install a PDF "psuedo" printer in whatever OS I am using. That way I can use the print facilities of whatever app I am using (like Eclipse for example) and get the result in PDF file.
EDIT:
Here is one example of a pseudo printer, this for the Windows platform. Mac OS X has a built in "print to PDF file" capability.
You can use doxygen to generate documentation for your project which can include a formatted source file listing in addition to Javadoc. doxygen can generate both HTML and PDF output. You'll need latex to generate the PDF output.
Another way to pretty print is with IntelliJIDEA. It works also with the community edition.
It's advisable to install a PDF printer, in order to try printouts without wasting a lot of paper. Once you're satisfied with the result, you can print on the real printer. On Windows you can use CutePDF, on Linux Ubuntu install the package cups-pdf with sudo apt-get install cups-pdf.
Note that IntelliJ prints the theme's background, so it's advisable to be on a white background to avoid wasting ink.
To print click on menu File -> Print. The printer selection is in the next menu, after you press on the Print button.
Interestingly you can also print only the selected text, which is useful if you don't want to print import statements.
Other options include the possibility to add line numbers, syntax highlighting and colour printing. On Linux IntelliJ 14.0.3, the default font was a huge size 14, so you might want to change that too.
You could just copy & paste into Word (2007+) and save as PDF. It's a little more straightforward than the file printer, and you can format your code for best results in Word.
You could just copy & paste into OpenOffice/LibreOffice and export to PDF.

Extract text from PDF (google app engine)

Is there any free Java library for extracting text from PDF, that is compatible with Google Application Engine?
I've read about PDFJet, but it can't read PDF, can it?
Is there perhaps other way how to extract text from PDF? I tried http://www.pdfdownload.org/, unfortunately they don't handle non-English characters correctly.
iText now has a text parsing module (I'm one of the parser authors). See the com.itextpdf.text.pdf.parser.PdfContentReaderTool class for an example of how to use it.
PdfBox does not run on GAE. It uses not-allowed java classes.
(GAE only permits these http://code.google.com/appengine/docs/java/jrewhitelist.html)
I have partially modified a very old version of PdfBox (0.7.3) to be GAE complaiant. Now I'm able to extract text from PDF (whole page or rectangular area). I only modified a minumum part of the pdf text extraction and not the whole PdfBox. :)
The idea was to remove refences to java.awt.retangle & C. using my own "rectangle" class.
More info: http://fhtino.blogspot.com/2010/04/pdfbox-text-extration-gae.html
I modified the latest (1.8.0-Snapshot) version to run on Google AppEngine. Had to disable one Unit-Test, but it runs fine for simple text extraction.
Following the simple try-fail-fix approach i had to modify 5 files in total. Pretty doable.
You'll also have to explicitly use a RandomAccessBuffer, like Fabrizio explained.
For the extra lazy, heres the compiled jar, dependencies for text extraction, and the patch. Note that it might not work for every usecase (i.e. rectangle based extraction). Used it to extract text of a whole page.
https://docs.google.com/folder/d/0B53n_gP2oU6iVjhOOVBNZHk0a0E/edit
I know there is http://pdfbox.apache.org/index.html
Apache PDFBox is an open source Java
PDF library for working with PDF
documents. This project allows
creation of new PDF documents,
manipulation of existing documents and
the ability to extract content from
documents.
but I've never tested it.
Last month, I'd just finished extracting text from pdf file in my project. I used XPDF tool for getting text, and text coordinates, but I used it in Xcode (Objective-C). This tool was open source, written by C++, and able to be encoded in many language. However, I didn't know whether XPdf would be work on your java, or not. Anyway, You can try this tool.

How to read windows .exe file version?

I need to parse the file version and product version from windows exe and msi files.
Could you point me to the file specification or the library (preferably in Java) that does that?
UPDATE:
It turns out I cannot use winapi, as the code needs to run on linux as well...
You could use GetFileVersionInfoSize and GetFileVersionInfo functions to get file version and product version. I'm not guru in Java but as far as I know there is the possibility to use WinAPI functions.
I have a Delphi program which can analyze PE/NE headers in Windows EXE files - but not right here. I think it can be ported to Java easily as it does a binary analysis of the files.
Of course, using a JNA calls to Windows API could do the trick on Windows.
Edit: I found it, but there are some minor glitches:
The original aim of the program was to extract resources from EXE (PE/NE) files, as at the time the available resource editors only worked with the PE format (NE is used by Win3.1)
The UI does not display the version info, but the record structure is there for it
The UI is entirely in Hungarian, I can provide translation if needed
Some of the code comments are in Hungarian, except some record structure descriptions, which are in English
I don't know if it compiles or not by its own today.
The ZIP includes a compiled Win32 RESXPLOR.EXE
The code has some buffer overrun bugs here or there - should be easily to fix it in Java

Categories