Parsing a directory of logs in Hadoop 0.20.2

Parsing a directory of logs in Hadoop 0.20.2 - java

I have a directory of text-based, compressed log files, each containing many records. In older versions of Hadoop I would extend MultiFileInputFormat to return a custom RecordReader which decompressed the log files and continue from there. But I'm trying to use Hadoop 0.20.2.
In the Hadoop 0.20.2 documentation, I notice MultiFileInputFormat is deprecated in favor of CombineFileInputFormat. But to extend CombineFileInputFormat, I have to use the deprecated classes JobConf and InputSplit. What is the modern equivalent of MultiFileInputFormat, or the modern way of getting records from a directory of files?

What is the modern equivalent of MultiFileInputFormat, or the modern way of getting records from a directory of files?
o.a.h.mapred.* has the old API, while the o.a.h.mapreduce.* is the new API. Some of the Input/Output formats have not been migrated to the new API. MultiFileInputFormat/CombineFileInputFormat have not been migrated to the new API in 20.2. I remember a JIRA being opened to migrate the missing formats, but I don't remember the Jira #.
But to extend CombineFileInputFormat, I have to use the deprecated classes JobConf and InputSplit.
For now it should be OK to use the old API. Check this response in the Apache forums. I am not sure of the exact plans for stopping the support to the old API. I don't think many have started using the new API, so I think it would be supported for a foreseeable future.

Related

Use Apache commons-compress to automatically handle various compressed archives

I would like to use commons-compress to work with various compression/archive formats.
However on first look it seems commons-compress only supports detecting some types of files, but only based on the first few bytes.
Is there a way I can use commons-compress to automatically detect file-types based on file extension? I surely can build this myself, but it would be nice to have this provided by the compression library itself.

After some more digging, I found that there are a few classes that help here, namely FilenameUtil, BZip2Utils, GzipUtils, ... so for each supported format, there is a *Utils class which allows to detect this type by extension.
See e.g. http://commons.apache.org/proper/commons-compress/javadocs/api-1.10/org/apache/commons/compress/compressors/bzip2/BZip2Utils.html

What to use instead of XMLBeans now that it has been retired?

I am starting a new project where I have third party XSD. My Java application needs to generate, and readm XML messages that conform to this XSD. In the past I have used Apache XMLBeans for this. It looks like XMLBeans has been retired.
What is a good replacement for XMLBeans now that it has been retired? I have used XStream on other projects but I don't recall that XStream has the ability to generate Java classes from an XSD so I am thinking that it is not a good choice over XMLBeans for this use case. I have hundreds of types defined in the XSD and would really prefer not to have to create the hundreds of Java classes to represent them in Java by hand.
In other words, using the XStream example, I have a Person type (and 99 others) defined in the XSD. Using XMLBeans I can generate the Java classes to represent these objects, but using XStream I would need to create the Java classes (e.g. Person) by hand or using some other tool. What tool should I use in this case?

Have you looked at JAXB? I haven't done anything with either of these, but googling for "alternative to XMLBeans" brings up lots of references to this package. Here's an article that compares them...
http://blog.bdoughan.com/2012/01/how-does-jaxb-compare-to-xmlbeans.html

XMLBeans has been unretired:
The Apache POI project has unretired the XMLBeans codebase and is maintaining it as a sub-project. Until now the XMLBeans codebase was held in the Apache Attic where former Apache projects are kept for the public good.
The latest release as of August 2020 is:
3.1.0 (March 26, 2019)
Having said that, I am not sure I would recommend using it, based on its history of retirement. Other solutions, such as JAXB, might be prefered since it will probably be better maintained in the future.

Whats the difference between mapred and mapreduce packages in apache avro?

I am working on my project to integrate apache avro into my
MapR program. However, I am very confused
by the usage of new mapreduce packages compared to mapred.
The latter takes a detailed instruction on how to use
in different situations and less information is given for the new.
But what I knew is that they correspond to new and old interfaces of hadoop.
Does anyone has any experience or examples using mapreduce interfaces
for jobs whose input is non-Avro data
(such as TextInputFormat) file
and output is avro file.

The two packages represent input / output formats, mapper and reducer base classes for the corresponding Hadoop mapred and mapreduce APIs.
So if your job uses the old (mapred) package APIs, then you should use the corresponding mapred avro package classes.
Avro has an example word count adaptation that uses Avro output format, which should be easy to modify for the newer mapreduce API:
http://svn.apache.org/viewvc/avro/trunk/doc/examples/mr-example/src/main/java/example/AvroWordCount.java?view=markup
Here's some gist with the modifications: https://gist.github.com/chriswhite199/6755242

Invoke HSSF Serializer Invocation

I have to write a very large XLS file, I have tried Apache POI but it simply takes up too much memory for me to use.
I had a quick look through StackOverflow and I noticed some references to the Cocoon project and, specifically the HSSFSerializer. It seems that this is a more memory-efficient way to write XLS files to disk (from what I've read, please correct me if I'm wrong!).
I'm interested in the use case described here: http://cocoon.apache.org/2.1/userdocs/xls-serializer.html . I've already written the code to write out the file in the Gnumeric format, but I can't seem to find how to invoke the HSSFSerializer to convert it to XLS.
On further reading it seems like the Cocoon project is a web framework of sorts. I may very well be barking up the wrong tree, but:
Could you provide an example of reading in a file, running the HSSFSerializer on it and writing that output to another file? It's not clear how to do so from the documentation.

My friend, HSSF serializer is part of POI. You are just setting certain attributes in the xml to be serialized (but you need a whole process to create it). Also, setting a whole pipeline using this framework just to create a XLS seems odd as it changes the app's architecture. ¿Is that your decision?
From the docs:
An alternate way of generating a spreadsheet is via the Cocoon
serializer (yet you'll still be using HSSF indirectly). With Cocoon
you can serialize any XML datasource (which might be a ESQL page
outputting in SQL for instance) by simply applying the stylesheet and
designating the serializer.
If memory is an issue, try XSSF or SXSSF in POI.

I don't know if by "XLS" you mean a specific, prior to Office 2007, version of this "Horrible SpreadSheet Format" (which is what HSSF stands for), or just anything you can open with a recent version of MS Office, OpenOffice, ...
So depending on your client requirements (i.e. those that will open your Excel file), another option might be available : generating a .XLSX file.
It comes down to producing an XML file in the proper grammar, which seems to be fit to your situation, as you seem to have already done that with the Gnumeric XML-based file format without technical trouble, and without hitting memory-effisciency issues.
Please note other XML-based spreadsheet formats exist, that Excel and other clients would be able to use. You might want to dig into the open document file formats.
As to wether to use Apache Cocoon or something else:
Cocoon can sure host the XSL processing ; batch (Cocoon CLI) processing is available if you require Cocoon, but require it not to run as a webapp (though as far as I remember, CLI feature was broken in the lastest builds of the 2.1 series) ; and Cocoon comes with a load of features and technologies that could address further requirements.
Cocoon might be overkill if it just comes down to running an XSL transformation, for which there is a bunch of well-known, lighter tools you can pick from.

How to poll directory to check whether new file is added?

I want to poll a directory to check whether new file is added to the directory. If any new file is added I want to read that file.
Can anybody give me an idea how to do that?

Java 7 has a file watcher API
JNotify will do it as well.

If you are using Java 7, you can use the filesystem watch service (a new feature in Java 7).
See Oracle's tutorial that explains how to use it.
Otherwise (if you're using an older Java version) you can use a library such as Apache Commons IO. Look at the package org.apache.commons.io.monitor - it has classes to check for changes in files and directories.

jNotify would be useful,
See Also
directory-listener-in-java

Why not Apache Camel?
Here's what the code will look like:
from("file://pollingfolder?delete=true").to("bean:handleOrder");
That reads the files from "pollingfolder", deletes it upon read, and sends it to a bean called "handleOrder". Just in one line!
There's an easy way to configure it with Spring-boot automagically if you use Spring, but it can be used in plain Java as well.
Source: Apache Camel

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Parsing a directory of logs in Hadoop 0.20.2 - java

Related

Use Apache commons-compress to automatically handle various compressed archives

What to use instead of XMLBeans now that it has been retired?

Whats the difference between mapred and mapreduce packages in apache avro?

Invoke HSSF Serializer Invocation

How to poll directory to check whether new file is added?

Categories

Resources