Write/Read Apache Parquet Files in Java/Spring - java

I'm just trying to write a parquet file, but every example I find on google uses deprecated methods or simply just doesn't work.
Besides, there doesn't seem to be any official documentation with examples.
It would be interesting to see a writting example and a reading-to-POJO example as well.

Related

Read Parquet Files using Apache Arrow

I have some Parquet files that I've written in Python using PyArrow (Apache Arrow):
pyarrow.parquet.write_table(table, "example.parquet")
Now I want to read these files (and preferably get an Arrow Table) using a Java program.
In Python, I can simply use the following to get an Arrow Table from my Parquet file:
table = pyarrow.parquet.read_table("example.parquet")
Is there an equivalent and easy solution in Java?
I couldn't really find any good / working examples nor any usefull documentation for Java (only for Python). Or some examples don't provide all needed Maven dependencies. I also don't want to use a Hadoop file system, I just want to use local files.
Note: I also found out that I can't use "Apache Avro" because my Parquet files contains column names with the symbols [, ] and $ which are invalid characters in Apache Avro.
Also, can you please provide Maven dependencies if your solution uses Maven.
I am on Windows and using Eclipse.
Update (November 2020): I never found a suitable solution and just stuck with Python for my usecase.
it's somewhat an overkill, but you can use Spark.
https://spark.apache.org/docs/latest/sql-data-sources-parquet.html

Extract JSON-LD from HTML using Apache Any23

My aim is to extract structured data from webpages. I'm using the code mentioned in this SO question. I'm using Apache Any23 CLI library dependency in my Spring project.
By using this, I'm able to extract the HTML5 Microdata (Schema.org) from webpages. But, I can't extract the JSON-LD format present in the webpages. When I checked Apache Any23's documentation, JSON-LD format is supported in it. Didn't find any further documentations on it.
Usually, if you create a new Any23 extractor with new Any23() it should work out of the box. If you use another constructor like Any23(String... extractorNames) you have to make make sure that the correct one is added for embedded JSON LD, which is "html-embedded-jsonld".
Now if there are any errors in the extraction process, Any23 drops them silently. (It's great, I know!)
I found it is possible to set a breakpoint in the org.apache.any23.extractorExtractionResultImpl method notifyIssue. With this you may be able to find a more detailed reason for your problems.

Write parquet in java

I want to write a parquet file in standalone java, in local filesystem (not on hadoop).
How to do this?
I know I can do this easily with spark, but I need to do this in standalone java so no hadoop, spark, ecc.
See this blog post: http://blog.antlypls.com/blog/2015/12/02/how-to-write-data-into-parquet/
Short version is you need to define/provide a schema and utilize the appropriate ParquetWriter.

Java Jackcess Library Documentation?

I need to read and write some data on .mdb Access file and over the web I found the Jackcess library that that does exactly that.
Unfortunately I could't find any documentation to use that. On the library website there are a couple of examples, but no real documentation. Can anyone tell me if there's some sort of documentation somewhere?
The javadoc is intended to be fairly explanatory. The primary classes would be Database and Table. The library is also heavily unit tested, so you can dig into the unit test code to see many examples. There isn't currently a great "getting started" document. It has been discussed before, but, unfortunately no one has picked up the ball on actually writing it. That said, the help forum is actively monitored.
UPDATE:
There is now a cookbook, which is the beginnings of a more comprehensive user-level documentation.
You can use jackcess-orm that use DAO pattern and POJO with annotations.

quartz: documentation for xml files?

Where is the documentation for quartz xml files (specifically jobs.xml)? I found the javadoc online, but I can't seem to find the documentation for how to write an xml file, just some brief examples e.g. this one from O'Reilly.
edit: apparently the java class that reads the jobs.xml is JobInitializationPlugin, but I don't see the docs for the xml format there either.
This is really poorly documented. Beyond the brief mention at the OpenSymphony site, the only documentation comes in the form of a Document Type Definition (DTD) and an XML Schema. If you're familiar with these formats, you can use them to figure out what tags are available.
If you download the full Quartz distribution, they are located at /quartz/src/main/resources/org/quartz/xml/. You can also find them inside of quartz-1.7.3.jar at /org/quartz/xml/. The files are named job_scheduling_data_1_5.dtd and job_scheduling_data_1_5.xsd.

Categories