I am trying to read word files into R in order to text parse them. After researching for a little while I found that Apache POI is the way to go for me, because it appears to be the most flexible w.r.t. handling different Word formats.
I tried to follow what the R packages xlsx' orcommonJavaJarsandxlsxjars` do. Unfortunately I was not able to create a few lines of R that work analogously.
E.g.:
inputStream <- .jnew("java/io/FileInputStream", path.expand(file))
wbFactory <- .jnew("org/apache/poi/ss/usermodel/WorkbookFactory")
What I do get from this, is that first an input stream is created (which i was able to do for a word fie as well). Then this Workbook Factory is created from the apache poi library using another .jnew. Looking for the a similar functionality for word I found this part of the POI package and tried:
wdoc <- .jnew("org/apache/poi/hwpf/HWPFDocument")
All i got is a java.lang.ClassNotFoundException. POI packages other than Excel relevant packages should be available as there's a poi-3.9-20121203.jar in the source code of xlsxjars which contains the .jars xlsx depends on.
Also tried to use the package commonJavaJars and ran the function
loadJars("poi")
without an error, but did not succeed with subsequent calls. Can someone get me started here?
EDIT:
I obviously miss a package here. Can I instantly load additional jars into my R session or do I have to compile a package to add new jars?
Apache POI provides a handy page of the POI components, their jars and their dependencies. If you look on that, you'll see that to use HWPF you need both the main poi jar and the poi-scratchpad jar
So, assuming you're sticking with poi-3.9 (and not using the latest version, which is 3.10 beta 2 as of writing), you'll need to list poi-3.9-20121203.jar and poi-scratchpad-3.9-20121203.jar on your classpath. Once both are there, you should be fine to use HWPF
Since you're using R, if you decide to use the CommonJavaJars library you should refer to the R loadJars documentation for details about how to load all of the jars you need in one go.
Alternately, if you want to skip CommonJavaJars and do it all by hand, then the following snippet shows how to to extract the text from a Word Document from R. Note - it's not pretty, because the R Java interface is decidedly low level...
library(rJava)
.jinit()
.jaddClassPath("poi-3.10-beta3-20131022.jar")
.jaddClassPath("poi-scratchpad-3.10-beta3-20131022.jar")
inputStream <- .jnew("java/io/FileInputStream", path.expand("test.doc"))
wdoc <- .jnew("org/apache/poi/hwpf/HWPFDocument",
.jcast(inputStream,"java/io/InputStream"))
wext <- .jnew("org/apache/poi/hwpf/extractor/WordExtractor", wdoc)
text <- .jcall(wext, "Ljava/lang/String;", "getText")
print(text)
If you want to use other components of Apache POI, be sure to look at the components page to review any dependencies for them (some have more than others)
Related
I recently tried to add the google tink library to eclipse and it always has a "com.google.protobuf.GeneratedMessageV3$ cannot be resolved" error, I normally never have any problems with adding libraries to my project, and from what I can tell it has something to do with the all key template files since the error only occours when I try to generate a new KeysetHandle with any key template, and the error only starts when i enter in the key template file# https://github.com/Gameidite/testProject
The Protobuf library can generate Java classes for you. You need to find where these .class files have been output to (eg there should be a GeneratedMessageV3$.class somewhere) and make sure that they are included on your classpath. There's presumably somewhere in Eclipse where you can configure where it looks for class files - you'll need to add the generated files there.
If the generated class files don't exist yet you need to figure out what to do to generate them. It might be easier to use Maven or Gradle as suggested in the Tink documentation rather than directly adding things to Eclipse.
I think it's probably because Eclipse cannot find the protobuf Java runtime. Have you tried adding Tink to your project with Maven or Gradle?
I'm testing the example codes from this page:
https://svn.apache.org/viewvc/pdfbox/trunk/examples/src/main/java/org/apache/pdfbox/examples/signature/
But inside the file CreateSignatureBase.java, exactly in the functions getMDPPermission and setMDPPermission, it calls a property that doesn't exist anymore: COSName.DOCMDP. I perused the Pdfbox page and its migration guide and it doesn't mention this property and how to replace it. I also looked into the PDfbox source code (exactly the file COSName.java) and It doesn't have that property, despite this file:
https://svn.apache.org/viewvc/pdfbox/branches/2.0/pdfbox/src/main/java/org/apache/pdfbox/cos/COSName.java?view=markup does have it.
I checked both pdfbox-2.0.4.jar and pdfbox-app-2.0.4.jar adding them to the Netbeans project where I'm testing the java files from the pdfbox examples. None of them have the property COSName.DOCMDP in the COSName class.
Both jars and the pdfbox sourcecode are downloaded from here:
https://pdfbox.apache.org/download.cgi#20x
How can I replace the property COSName.DOCMDP in the CreateSignatureBase class? Am I getting the right jars?
It will appear in 2.1.0 version:
https://issues.apache.org/jira/browse/PDFBOX-3017
https://issues.apache.org/jira/browse/PDFBOX-3699
https://svn.apache.org/viewvc/pdfbox/trunk/pdfbox/src/main/java/org/apache/pdfbox/cos/COSName.java?annotate=1786065
If you need it for testing purposes, you may download it's SNAPSHOT version from https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox/
Or, you may see this example in current stable version - just download 2.0.4 jar and browse examples.
I need to write a java plugin will draw on the attributes using the rhapsody . What do you recommend for that. Where should I start ? Previously I did not write plug-ins.
First place to start is to look at the samples provided by IBM. You can find them (on Windows 7, version 7.5.3 of Rhapsody) in:
C:\Users\\IBM\Rational\Rhapsody\7.5.3\Samples\ExtensibilitySamples
There are 3 types you can create:
1. A plugin (what you are asking about)
2. A Check plugin (ties into the model check sub-system)
3. Event callback plugin (don't know much about this one)
I've written 1 and 2.
There should be a how-to document in and around that directory area that walks you through creating a simple plugin. If not, it probably is available in the Rhapsody help (from within the tool)
Basically, you write your Java plugin to conform to a specific interface that IBM provides(com.telelogic.rhapsody.core.RPUserPlugin), create a .hep file that describes the details of that, and then drop the .hep file into the .rpy folder of your project. You then create a new profile in your model with the same name as your .hep file and that should link to the .hep information.
A sample .hep file looks like this:
[Helpers]
numberOfElements=1
#REM: Tranformer Generation plug-in
name1=Generate Transformers
JavaMainClass1=sida.jni.transformerplugin.TransformerPlugin
JavaClassPath1=..\TransformerPlugin\DefaultConfig
isPlugin1=1
isVisible1=1
DLLServerCompatible1=1
Take special note of the numbers added to the end of the attribute names:
ex. isPlugin1, isVisible1
You will want to match that to the name# attribute in the file.
Then make sure your java plugin class files are on the classpath or (better yet), co-located to your .rpy folder. For example, our plugins sit in a folder right next to (at the same level as) our .rpy folder.
If all goes well, you should see an initialization string spit out in the Rhapsody console window for the plugin.
Hope this gets you started...
EDIT This question is not about how to solve dependencies using Ant / Maven / Gradle or whatnots.
I'm trying to use Neo4j and I'm a bit confused by the docs as to what I need to embed a very simple "Hello, world!" Neo4j example in an app.
I've read in several places that Neo4j was lightweight and that only one (and now two) jars where needed.
For example here: http://highscalability.com/neo4j-graph-database-kicks-buttox
we can read: "Small footprint. Neo4j is a single <500k jar with one dependency (the Java Transaction API)."
This is precisely one of the reason I'm interested in Neo4j to embed it...
So I downloaded the community edition (GPL) of Neo4j and read the explanation here:
http://docs.neo4j.org/chunked/stable/tutorials-java-embedded-setup.html
which says: "Extract a Neo4j download zip/tarball, and use the jar files found in the lib/ directory."
Now that's more than concise and I've found old messages saying that the "wording was changed". At one point all that Neo4j needed was one jar apparently (which is one of the reason I was interested in embedding Neo4j btw). But now apparently it's two, because there's a dependency on some Java Transaction API (which one? a .jar shipped with neo4j?)
The problem is that if I look in that lib/ dir I've got quite some things:
1115454 lib/neo4j-kernel-1.6.1.jar
153707 lib/neo4j-graph-algo-1.6.1.jar
222791 lib/neo4j-shell-1.6.1.jar
8865464 lib/scala-library-2.9.0-1.jar
43530 lib/neo4j-jmx-1.6.1.jar
590503 lib/neo4j-kernel-1.6.1-tests.jar
23954 lib/neo4j-community-1.6.1.jar
28023 lib/neo4j-udc-1.6.1.jar
1517975 lib/neo4j-cypher-1.6.1.jar
51662 lib/neo4j-graph-matching-1.6.1.jar
16030 lib/geronimo-jta_1.1_spec-1.1.1.jar
143177 lib/neo4j-lucene-index-1.6.1.jar
1466301 lib/lucene-core-3.5.0.jar
118875 lib/server-api-1.6.1.jar
92850 lib/org.apache.servicemix.bundles.jline-0.9.94_1.jar
And in system/lib:
27461 system/lib/blueprints-neo4j-graph-1.1.jar
72650 system/lib/jettison-1.3.jar
628626 system/lib/rrd4j-2.0.7.jar
17985 system/lib/asm-analysis-3.2.jar
177174 system/lib/jetty-util-6.1.25.jar
109043 system/lib/commons-io-1.4.jar
755981 system/lib/neo4j-server-1.6.1.jar
35910 system/lib/gremlin-java-1.4.jar
46367 system/lib/jsr311-api-1.1.1.jar
36551 system/lib/asm-util-3.2.jar
206035 system/lib/commons-beanutils-core-1.8.0.jar
227122 system/lib/jackson-core-asl-1.8.3.jar
33094 system/lib/asm-commons-3.2.jar
17308 system/lib/jcl-over-slf4j-1.6.1.jar
21878 system/lib/asm-tree-3.2.jar
12359 system/lib/log4j-over-slf4j-1.6.1.jar
.
. (skipped a few jars from system/lib here)
.
If my Emacs-fu is strong enough the jars above weight at nearly 17 MB (not that "embeddable")... And I didn't even paste all the jars from system/lib/.
So what is the minimum number of .jar (and which are they) do I need so that I can embed Neo4j and run a simple "Hello, world!" example?
I'm confused by the official doc saying: "... use the jar files found in the lib/ directory".
Surely I don't need all of them right?
Basically, you need only neo4j-kernel-1.6.1.jar (and the mentioned transaction API geronimo-jta_1.1_spec).
However, this will give you only the basic functionality. If you want to use other parts, like indexing, querying, management tools, etc., you would need other jars.
The absolute minimum to run the kernel is
neo4j-kernel.jar
jta.jar
The rest is Cypher, Lucene indexing and other stuff.
I need to transform one XML document into another using XSLT (for now from command line). I have to use Java 1.4.2. Based on that someone recommended using Saxon and provided the XSLT. It seems simple it should work, but I am lost.
I come more form a .NET environment, and have worked with XML and XSLT but not with Saxon and I am not that strong in Java.
Let me start by explaining what my problem is and what I have tried so far:
The Error:
C:\Projects\new_saxon_download>java net.sf.saxon.Transform -s:source.xml -xsl:style.xsl -o:output.xml
Exception in thread "main" java.lang.NoClassDefFoundError: org/xml/sax/ext/DefaultHandler2
at net.sf.saxon.Configuration.(Configuration.java:2047)
at net.sf.saxon.Transform.setFactoryConfiguration(Transform.java:81)
at net.sf.saxon.Transform.doTransform(Transform.java:133)
at net.sf.saxon.Transform.main(Transform.java:66)
Steps that led me here:
I downloaded Saxon-B by following a link from this page
I also found some information on a dependency about SAX2 from this
page and thus obtained that as well.
Set the CLASSPATH in my session:
set CLASSPATH=.;C:\Projects\new_saxon_download\saxon9.jar;C:\Projects\new_saxon_download\sax2r2.jar
Tried the transformation:
java net.sf.saxon.Transform -s:source.xml -xsl:style.xsl -o:output.xml
Then I get the error shown above. I have tried multiple google search, but nothing has helped.
Any advice or solution would be very helpful.
GOT IT - the description on how to fix the dependendcy issue is crap (sorry).
This file sax2r2.jar isn't the one you have to add to the classpath. It contains another jar (sax.jar) and that's the library you actually need. Just extract the sax2r2.jar and put sax.jar on the classpath, then it should work.
Give this a try: apache xml-commons includes xml-api.jar. I can't tell if this is usable with java 1.4.12 but it's worth a try.
Binary releases can be found here. Download one of the xml-commons-external archives, extract xml-api.jar and add that to your classpath.