Created Hive table with lzo compression, cant locate file with extension .lzo - java

I created a Hive table by setting the following Properties on hive command prompt:
SET mapreduce.output.fileoutputformat.compress.codec=com.hadoop.compression.lzo.LzopCodec
SET hive.exec.compress.output=true
SET mapreduce.output.fileoutputformat.compress=true
Create table statement:
create external table dept_comp1(id bigint,code string,name string) LOCATION '/users/JOBDATA/comp' ;
insert overwrite table dept_comp select * from src__1;
Now I go to this location /users/JOBDATA/comp and find a file named 000000_0.deflate
I am not sure that this is the compressed file though when I download it, its unreadable. If it is, then why does it not have an .lzo extension?
If it is not, where can I find the .lzo file?
Lastly how can I decompress it using java?
Thanks

You can use Snappycodec Compression if you have the intention to save your disk space on hdfs. There are some compressed formats like .bz which are splittable and by setting certain hive properties like
SET hive.exec.compress.output=true;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
SET mapred.output.compression.type=BLOCK;

Related

jDBI createStatement from a file

I want to put my query commands in a sql file, and then using createStatement read the query from the file and do the binding.
Doing h.createStatement("SOME LONG QUERY WITH BUNCH OF JOINS AND WHERES IS HARD TO READ IN JAVA") is not very legible.
What's the best way, other than using File to open and read the file?
Jdbi provides the ClasspathSqlLocator class to read files on the classpath.
For example, this returns the content of the file query.sql which is inside the folder jdbiTest on the classpath:
String query = ClasspathSqlLocator.findSqlOnClasspath("jdbiTest.query");
Link to the documentation: http://jdbi.org/#_classpathsqllocator

Java - Open CSV - .csv to .xls extension

We have generated a .csv file using Open CSV library in java. Our requirement is to change the extension from .csv to .xls .
When we changed the extension blindly(in java code) by renaming the file name to .xls in java, the data is not aligned or formatted properly.
In .csv file when we open it with excel values inside table are aligned properly. But when we change to .xls and open it, everything is comma separated values and populated inside one column i.e., the values in table are not populated under respective column. Please find below the screenshot.
enter image description here
So why not open the .csv file in excel and then do a "Save As" and for file type select excel spreadsheet.
That is the part you are missing. Changing the extension does not change the file type. You are just changing the way most computers see the file. Open up an real excel spreadsheet in a text editor and I assure you will see alot more than comma separated values.
You should look for vbs scripts, I know that I'm doing the opposite (xlsx to csv) using one of those script that I found here so I guess that it should be possible to do the opposite, I hope that you find your solution there !
Here is a script to convert a xlsx to a csv :
if WScript.Arguments.Count < 2 Then
WScript.Echo "Error! Please specify the source path and the destination. Usage: XlsToCsv SourcePath.xls Destination.csv"
Wscript.Quit
End If
Dim oExcel
Set oExcel = CreateObject("Excel.Application")
Dim oBook
Set oBook = oExcel.Workbooks.Open(Wscript.Arguments.Item(0))
oBook.SaveAs WScript.Arguments.Item(1), 6
oBook.Close False
oExcel.Quit
I think you need to use "Apache POI - the Java API" for .xls

Save image file to HDFS using Spark

I have an image file
image = JavaSparkContext.binaryFiles("/path/to/image.jpg");
I would like to process then save the binary info using Spark to HDFSSomething like :
image.saveAsBinaryFile("hdfs://cluster:port/path/to/image.jpg")
Is this possible, not saying 'as simple', just possible to do this? if so how would you do this. Trying to keep a one to one if possible as in keeping the extension and type, so if I directly download using hdfs command line it would still be a viable image file.
Yes, it is possible. But you need some data serialization plugin, for example avro(https://github.com/databricks/spark-avro).
Assume image is presented as binary(byte[]) in your program, so the images can be a Dataset<byte[]>.
You can save it using
datasetOfImages.write()
.format("com.databricks.spark.avro")
.save("hdfs://cluster:port/path/to/images.avro");
images.avro would be a folder contains multiple partitions and each partition would be an avro file saving some images.
Edit:
it is also possible but not recommended to save the images as separated files. You can call foreach on the dataset and use HDFS api to save the image.
see below for a piece of code written in Scala. You should be able to translate it into Java.
import org.apache.hadoop.fs.{FileSystem, Path}
datasetOfImages.foreachPartition { images =>
val fs = FileSystem.get(sparkContext.hadoopConfiguration)
images.foreach { image =>
val out = fs.create(new Path("/path/to/this/image"))
out.write(image);
out.close();
}
}

Java: `A` Archive attribute missing while creating zip programmatically

We are dealing with the decompression libraries/utility that uses attribute to check for the presence of directories/files within the zip.
Problem is that we are not able to set archive bit for a zip while creation. When we create zip programmatically, it wash out previous attributes as well.
We will try to set archive bit with below mentioned steps but not getting desired result so far:
1. Parse each zip entry and getExtra byte[].
2. Use Int value=32 and perform bitwise 'OR' operation.
3. setExtra byte[] after 'OR' operation.
Adding some more details:
We tried following approaches but still this issue is unresolved.
Using setAttribute() method on File system but getting the attributes are getting reset while creating zip.
Files.setAttribute(file, “dos:archive”, true)
Using File.copy() which copies the file attributes associated with the file to the target file but no success. Even existing attributes are not being retained to target file.
Files.copy(path, path, StandardCopyOption.COPY_ATTRIBUTES)
Using ZipEntry.setExtra(byte[]).
found some info online that the java doesn’t have any direct method to set attributes but as per some online articles we found that the extra field is used to set the file permissions on unix and MS DOS file attributes. This is an undocumented field and we didn’t find any reliable information online. Basically, initial 2 bytes are used for unix and last 2 bytes are used for DOS file attributes. We tried setting DOS file attributes with different values in it.
ZipEntry.setExtra(byte[]) - Sets the optional extra field data for the entry.
Using winzip command line tool but not an elegant solution.
I assume it is DOS (Windows)
With Java 7
import java.nio.file.Files;
import java.nio.file.Path;
File theFile = new File("yourfile.zip");
Path file = theFile.toPath();
Files.setAttribute(file, "dos:archive", true);
see: http://kodejava.org/how-do-i-set-the-value-of-file-attributes/

Java: Marking/Flagging a file

I would like to know whether or not there is some way of marking a file to identify whether or not the file contains x.
Consider the following example:
During a batch conversion process I am creating a log file which lists the success / failure of individual conversions.
So the process is as follows:
start conversion process
create log file named batch_XXX_yyyy_mm_dd.log
try to convert 'a'
write success to log file
try to convert 'b'
write success to log file
...
try to convert 'z'
write success to log file
close and persist log file
What I would like to be able to do is mark a file in some way that identifies whether any of the conversions logged in the file were unsuccessful.
I do not want to change the file name (visibly) and I do not want to open the file to check for a marker.
Does anyone have any ideas on how this could be achieved?
You can add file attributes in Java 7 through the java.nio.file.Files class.
So it would be possible to mark whether a file contains X using the Files.setAttribute() method:
Files.setAttribute( "path/to/file", "containsX", true );
And then check whether the file does contain X using the Files.getAttribute( ) method:
Files.getAttribute( "path/to/file", "containsX" )
If you are looking into say
file.log
create another file which will maintain this info say
file.log.status
Your status file can then contain all the information you need. It will be easier to get the status of conversion for all the files as well as easy to map back to original file given a status file.

Categories