Java - Download sequence file in Hadoop - java

I have problem to copy the binary files (which is store as sequence files in Hadoop) to my local machine. The problem is that the binary file I downloaded from hdfs was not the original binary file I generated when I'm running map-reduce tasks. I Googled similar problems and I guess the issue is that when I copy the sequence files to my local machine, I got the header of the sequence file plus the original file.
My question is: Is there any way to avoid download the header but still preserve my original binary file?
There are two ways I can think about:
I can transform the binary file into some other format like Text so that I can avoid using SequenceFile. After I do copyToLocal, I transform it back to binary file.
I still use the sequence file. But when I generate the binary file, I also generate some meta information about the corresponding sequence file (e.g. the length of the header and the original length of the file). And after I do copyToLocal, I use the downloaded binary file (which contains header, etc.) along with the meta information to recover my original binary file.
I don't know which one is feasible. Could anyone give me a solution? Could you also show me some sample code for the solution you give?
I highly appreciate your help.

I found a workaround for this question. Since downloading sequence file will give you header and other magic word in the binary file, the way I avoid this problem is to transform my original binary file into Base64 String and store it as Text in HDFS and when downloading the encoded binary files, I decode it back to my original binary file.
I know this will take extra time but currently I don't find any other solution to this problem. The hard part to directly remove headers and other magic words in the sequence file is that Hadoop may insert some word "Sync" in between my binary file.
If anyone have a better solution to this problem, I'd be very happy to hear about that. :)

Use a MapReduce Code to read the SequenceFile and use the SequenceFileInputFormat as InputFileFormat to read the Sequence File in HDFS. This would split the file as Key Value pairs and the value would have only the binary file contents which you can use to create your binary file.
Here is a code snippet to split a sequence file that is made of multiple images and split that into individual binary files and write it into local file system.
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.BytesWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class CreateOrgFilesFromSeqFile {
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
if (args.length !=2){
System.out.println("Incorrect No of args (" + args.length + "). Expected 2 args: <seqFileInputPath> <outputPath>");
System.exit(-1);
}
Path seqFileInputPath = new Path(args[0]);
Path outputPath = new Path(args[1]);
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "CreateSequenceFile");
job.setJarByClass(M4A6C_CreateOrgFilesFromSeqFile.class);
job.setMapperClass(CreateOrgFileFromSeqFileMapper.class);
job.setInputFormatClass(SequenceFileInputFormat.class);
job.setOutputKeyClass(NullWritable.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, seqFileInputPath);
FileOutputFormat.setOutputPath(job, outputPath);
//Delete the existing output File
outputPath.getFileSystem(conf).delete(outputPath, true);
System.exit(job.waitForCompletion(true)? 0 : -1);
}
}
class CreateOrgFileFromSeqFileMapper extends Mapper<Text, BytesWritable, NullWritable, Text>{
#Override
public void map(Text key, BytesWritable value, Context context) throws IOException, InterruptedException{
Path outputPath = FileOutputFormat.getOutputPath(context);
FileSystem fs = outputPath.getFileSystem(context.getConfiguration());
String[] filePathWords = key.toString().split("/");
String fileName = filePathWords[filePathWords.length-1];
System.out.println("outputPath.toString()+ key: " + outputPath.toString() + "/" + fileName + "value length : " + value.getLength());
try(FSDataOutputStream fdos = fs.create(new Path(outputPath.toString() + "/" + fileName)); ){
fdos.write(value.getBytes(),0,value.getLength());
fdos.flush();
}
//System.out.println("value: " + value + ";\t baos.toByteArray().length: " + baos.toByteArray().length);
context.write(NullWritable.get(), new Text(outputPath.toString() + "/" + fileName));
}
}

Related

Failed to remove the header of multiple csv files

I want to remove the header of multiple csv files. So when I am trying to do this it's showing error. But I am able to remove the single csv file's header by this way.
What I have missed to achieve my target that I can remove multiple csv files's header at one shot? I need help on this.
Note: I have given correct filename, directory name, or volume label syntax.
package hadoop;
import java.io.IOException;
import java.io.RandomAccessFile;
class RemoveLine
{
public static void main(String...args) throws IOException
{
RandomAccessFile raf = new RandomAccessFile("F://sample1/*.csv", "rw");
//Initial write position
long writePosition = raf.getFilePointer();
raf.readLine();
// Shift the next lines upwards.
long readPosition = raf.getFilePointer();
byte[] buff = new byte[1024];
int n;
while (-1 != (n = raf.read(buff))) {
raf.seek(writePosition);
raf.write(buff, 0, n);
readPosition += n;
writePosition += n;
raf.seek(readPosition);
}
raf.setLength(writePosition);
raf.close();
}
}
Output:
Exception in thread "main" java.io.FileNotFoundException: F:\sample1\*.csv (The filename, directory name, or volume label syntax is incorrect)
at java.io.RandomAccessFile.open0(Native Method)
at java.io.RandomAccessFile.open(Unknown Source)
at java.io.RandomAccessFile.<init>(Unknown Source)
at java.io.RandomAccessFile.<init>(Unknown Source)
at hadoop.RemoveLine.main(RemoveLine.java:12)
You are probably thinking of glob syntax which you use at the command line. Windows cmd and Linux bash takes something like *.csv and expands it into a list of all the matching file names.
On the other hand, Java's RandomAccessFile expect a specific file name and does not understand glob syntax. You must implement the behavior yourself. First you need to get a list of all the files which you want to change. Then you have to iterate over that list and perform the actions you want.

SequenceFile Compactor of several small files in only one file.seq

Novell in HDFS and Hadoop:
I am developing a program which one should get all the files of a specific directory, where we can find several small files of any type.
Get everyfile and make append in a SequenceFile compressed, where the key must be the path of the file, and the value must be the file got, For now my code is:
import java.net.*;
import org.apache.hadoop.fs.*;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.io.compress.BZip2Codec;
public class Compact {
public static void main (String [] args) throws Exception{
try{
Configuration conf = new Configuration();
FileSystem fs =
FileSystem.get(new URI("hdfs://quickstart.cloudera:8020"),conf);
Path destino = new Path("/user/cloudera/data/testPractice.seq");//test args[1]
if ((fs.exists(destino))){
System.out.println("exist : " + destino);
return;
}
BZip2Codec codec=new BZip2Codec();
SequenceFile.Writer outSeq = SequenceFile.createWriter(conf
,SequenceFile.Writer.file(fs.makeQualified(destino))
,SequenceFile.Writer.compression(SequenceFile.CompressionType.BLOCK,codec)
,SequenceFile.Writer.keyClass(Text.class)
,SequenceFile.Writer.valueClass(FSDataInputStream.class));
FileStatus[] status = fs.globStatus(new Path("/user/cloudera/data/*.txt"));//args[0]
for (int i=0;i<status.length;i++){
FSDataInputStream in = fs.open(status[i].getPath());
outSeq.append(new org.apache.hadoop.io.Text(status[i].getPath().toString()), new FSDataInputStream(in));
fs.close();
}
outSeq.close();
System.out.println("End Program");
}catch(Exception e){
System.out.println(e.toString());
System.out.println("File not found");
}
}
}
But after of every execution I receive this exception:
java.io.IOException: Could not find a serializer for the Value class: 'org.apache.hadoop.fs.FSDataInputStream'. Please ensure that the configuration 'io.serializations' is properly configured, if you're using custom serialization.
File not found
I understand the error must be in the type of the file I am creating and the type of object I define for adding to the sequenceFile, but I don't know which one should add, can anyone help me?
FSDataInputStream, like any other InputStream, is not intended to be serialized. What serializing an "iterator" over a stream of byte should do ?
What you most likely want to do, is to store the content of the file as the value. For example you can switch the value type from FsDataInputStream to BytesWritable and just get all the bytes out of the FSDataInputStream. One drawback of using Key/Value SequenceFile for a such purpose is that the content of each file has to fit in memory. It could be fine for small files but you have to be aware of this issue.
I am not sure what you are really trying to achieve but perhaps you could avoid reinventing the wheel by using something like Hadoop Archives ?
Thanks a lot by your comments, the problem was the serializer like you say, and finally I used BytesWritable:
FileStatus[] status = fs.globStatus(new Path("/user/cloudera/data/*.txt"));//args[0]
for (int i=0;i<status.length;i++){
FSDataInputStream in = fs.open(status[i].getPath());
byte[] content = new byte [(int)fs.getFileStatus(status[i].getPath()).getLen()];
outSeq.append(new org.apache.hadoop.io.Text(status[i].getPath().toString()), new org.apache.hadoop.io.BytesWritable(in));
}
outSeq.close();
Probably there are other better solutions in the hadoop ecosystem but this problem was an exercise of a degree I am developing, and for now We are remaking the wheel for understanding concepts ;-).

Get MIME type from dicom files in java

I have tried all the following:
import java.io.BufferedInputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.net.URLConnection;
import java.nio.file.Files;
public class mimeDicom {
public static void main(String[] argvs) throws IOException{
String path = "Image003.dcm";
String[] mime = new String[3];
File file = new File(path);
mime[0] = Files.probeContentType(file.toPath());
mime[1] = URLConnection.guessContentTypeFromName(file.getName());
InputStream is = new BufferedInputStream(new FileInputStream(file));
mime[2] = URLConnection.guessContentTypeFromStream(is);
for(String m: mime)
System.out.println("mime: " + m);
}
}
But the results are still: mime: null for each of the tried methods above and I really want to know if the file is a DICOM as sometimes they don't have the extension or have a different one.
How can I know if the file is a DICOM from the path?
Note: this is not a duplicate of How to accurately determine mime data from a file? because the excellent list of magic numbers doesn't cover DICOM files and the apache tika gives application/octet-stream as return which doesn't really identify it as an image and it's not useful as the NIfTI files (among others) get the exactly same MIME from Tika.
To determine if a file is Dicom, you best bet is to parse the file yourself and see if it contains the magic bytes "DICM" at the file offset 128.
The first 128 bytes are usually 0 but may contain anything.

How to get exact size of zipped file before zipping?

I am using following standalone class to calculate size of zipped files before zipping.
I am using 0 level compression, but still i am getting a difference of few bytes.
Can you please help me out in this to get exact size?
Quick help will be appreciated.
import java.io.ByteArrayInputStream;
import java.io.ByteArrayOutputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.util.zip.CRC32;
import java.util.zip.ZipEntry;
import java.util.zip.ZipInputStream;
import java.util.zip.ZipOutputStream;
import org.apache.commons.io.FilenameUtils;
public class zipcode {
/**
* #param args
*/
public static void main(String[] args) {
// TODO Auto-generated method stub
try {
CRC32 crc = new CRC32();
byte[] b = new byte[1024];
File file = new File("/Users/Lab/Desktop/ABC.xlsx");
FileInputStream in = new FileInputStream(file);
crc.reset();
// out put file
ZipOutputStream out = new ZipOutputStream(new FileOutputStream("/Users/Lab/Desktop/ABC.zip"));
// name the file inside the zip file
ZipEntry entry = new ZipEntry("ABC.xlsx");
entry.setMethod(ZipEntry.DEFLATED);
entry.setCompressedSize(file.length());
entry.setSize(file.length());
entry.setCrc(crc.getValue());
out.setMethod(ZipOutputStream.DEFLATED);
out.setLevel(0);
//entry.setCompressedSize(in.available());
//entry.setSize(in.available());
//entry.setCrc(crc.getValue());
out.putNextEntry(entry);
// buffer size
int count;
while ((count = in.read(b)) > 0) {
System.out.println();
out.write(b, 0, count);
}
out.close();
in.close();
} catch (FileNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
Firstly, I'm not convinced by explanation for why you need to do this. There is something wrong with your system design or implementation if it is necessary to know the file size before you start uploading.
Having said that, the solution is basically to create the ZIP file on the server side so that you know its size before you start uploading it to the client:
Write the ZIP file to a temporary file and upload from that.
Write the ZIP file to an buffer in memory and upload from that.
If you don't have either the file space or the memory space on the server side, then:
Create "sink" outputStream that simply counts the bytes that are written to calculate the nominal file size.
Create / write the ZIP file to the sink, and capture the file size.
Open your connection for uploading.
Send the metadata including the file size.
Create / write the ZIP a second time, writing to the socket stream ... or whatever.
These 3 approaches will all allow you to create and send a compressed ZIP, if that is going to help.
If you insist on trying to do this on-the-fly in one pass, then you are going to need to read the ZIP file spec in forensic detail ... and do some messy arithmetic. Helping you is probably beyond the scope of a SO question.
I had to do this myself to write the zip results straight to AWS S3 which requires a file size. Unfortunately there is no way I found to compute the size of a compressed file without performing the computation on each block of data.
One method is to zip everything twice. The first time you throw out the data but add up the number of bytes:
long getSize(List<InputStream> files) throws IOException {
final AtomicLong counter = new AtomicLong(0L);
final OutputStream countingStream = new OutputStream() {
#Override
public void write(int b) throws IOException {
counter.incrementAndGet();
}
};
ZipOutputStream zoutcounter = new ZipOutputStream(countingStream);
// Loop through files or input streams here and do compression
// ...
zoutcounter.close();
return counter.get();
}
The alternative is to do the above creating an entry for each file but then don't write any actual data (don't call write()) so you can compute the total size of just the zip entry headers. This will only work if you turn off compression like this:
entry.setMethod(ZipEntry.STORED);
The size of the zip entries plus the size of each uncompressed file should give you an accurate final size, but only with compression turned off. You don't have to set the CRC values or any of those other fields when computing the zip file size as those entries always have the same size in the final entry header. It's only the name, comment and extra fields on the ZipEntry that vary in size. The other entries like the file size, CRC, etc. take up the same space in the final zip file whether or not they were set.
There is one more solution you can try. Guess the size conservatively and add a safety margin, then compress it aggressively. Pad the rest of the file until it equals your estimated size. Zip ignores padding. If you implement an output stream that wrappers your actual output stream but implements the close operation as a noop then you can pass that as the output stream for your ZipOutputStream. After you close your ZipOutputStream instance, write the padding to the actual output stream to equal your estimated number of bytes, then close it for real. The file will be larger than it could be but you save the computation of the accurate file size and the result will benefit from at least some compression.

Using Hadoop to find files that contain a particular string

I have around 1000 files and each file is of the size of 1GB. And I need to find a String in all these 1000 files and also which files contains that particular String. I am working with Hadoop File System and all those 1000 files are in Hadoop File System.
All the 1000 files are under real folder, so If I do like this below, I will be getting all the 1000 files. And I need to find which files contains a particular String hello under real folder.
bash-3.00$ hadoop fs -ls /technology/dps/real
And this is my data structure in hdfs-
row format delimited
fields terminated by '\29'
collection items terminated by ','
map keys terminated by ':'
stored as textfile
How I can write MapReduce jobs to do this particular problem so that I can find which files contains a particular string? Any simple example will be of great help to me.
Update:-
With the use of grep in Unix I can solve the above problem scenario, but it is very very slow and it takes lot of time to get the actual output-
hadoop fs -ls /technology/dps/real | awk '{print $8}' | while read f; do hadoop fs -cat $f | grep cec7051a1380a47a4497a107fecb84c1 >/dev/null && echo $f; done
So that is the reason I was looking for some MapReduce jobs to do this kind of problem...
It sounds like you're looking for a grep-like program, which is easy to implement using Hadoop Streaming (the Hadoop Java API would work too):
First, write a mapper that outputs the name of the file being processed if the line being processed contains your search string. I used Python, but any language would work:
#!/usr/bin/env python
import os
import sys
SEARCH_STRING = os.environ["SEARCH_STRING"]
for line in sys.stdin:
if SEARCH_STRING in line.split():
print os.environ["map_input_file"]
This code reads the search string from the SEARCH_STRING environmental variable. Here, I split the input line and check whether the search string matches any of the splits; you could change this to perform a substring search or use regular expressions to check for matches.
Next, run a Hadoop streaming job using this mapper and no reducers:
$ bin/hadoop jar contrib/streaming/hadoop-streaming-*.jar \
-D mapred.reduce.tasks=0
-input hdfs:///data \
-mapper search.py \
-file search.py \
-output /search_results \
-cmdenv SEARCH_STRING="Apache"
The output will be written in several parts; to obtain a list of matches, you can simply cat the files (provided they aren't too big):
$ bin/hadoop fs -cat /search_results/part-*
hdfs://localhost/data/CHANGES.txt
hdfs://localhost/data/CHANGES.txt
hdfs://localhost/data/ivy.xml
hdfs://localhost/data/README.txt
...
To get the filename you are currently processing, do:
((FileSplit) context.getInputSplit()).getPath().getName()
When you are searching your file record by record, when you see hello, emit the above path (and maybe the line or anything else).
Set the number of reducers to 0, they aren't doing anything here.
Does 'row format delimited' mean that lines are delimited by a newline? in which case TextInputFormat and LineRecordReader work fine here.
You can try something like this, though I'm not sure if it's an efficient way to go about it. Let me know if it works - I haven't tested it or anything.
You can use it like this: java SearchFiles /technology/dps/real hello making sure you run it from the appropriate directory of course.
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.File;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Scanner;
public class SearchFiles {
public static void main(String[] args) throws IOException {
if (args.length < 2) {
System.err.println("Usage: [search-dir] [search-string]");
return;
}
File searchDir = new File(args[0]);
String searchString = args[1];
ArrayList<File> matches = checkFiles(searchDir.listFiles(), searchString, new ArrayList<File>());
System.out.println("These files contain '" + searchString + "':");
for (File file : matches) {
System.out.println(file.getPath());
}
}
private static ArrayList<File> checkFiles(File[] files, String search, ArrayList<File> acc) throws IOException {
for (File file : files) {
if (file.isDirectory()) {
checkFiles(file.listFiles(), search, acc);
} else {
if (fileContainsString(file, search)) {
acc.add(file);
}
}
}
return acc;
}
private static boolean fileContainsString(File file, String search) throws IOException {
BufferedReader in = new BufferedReader(new FileReader(file));
String line;
while ((line = in.readLine()) != null) {
if (line.contains(search)) {
in.close();
return true;
}
}
in.close();
return false;
}
}

Categories