I am working on a Java Maven project, and I have gotten to a point where I need to determine if my input from HDFS is either a directory of CSV files or a Parquet file. From my understanding, and I could be wrong, I believe HDFS stores Parquet files as directories.
My question is, what might be a good way of determining the difference between these two potential inputs so that I can handle each of them appropriately?
You can use Hadoop FileSystem API.
If you want to check whether an hdfsPath is a directory or a file use getFileStatus:
Path path = new Path(hdfsPath);
FileSystem fs = path.getFileSystem(conf);
FileStatus fileStatus = fs.getFileStatus(path);
if (fileStatus.isFile()) {
// .... logic for file
} else {
// ... logic for directory
}
To check if the directory contains Parquet on CSV files, you can use listStatus method to list the files under that directory, and for each file, you can check its extension to determine its type (.csv or .parquet).
Related
I'm using Hadoop 2.7.1 and coding in Java. I'm able to run a simple mapreduce program where I provide a folder as input to the MapReduce program.
However I want to use a file (full paths are inside ) as input; this file contains all the other files to be processed by the mapper function.
Below is the file content,
/allfiles.txt
- /tmp/aaa/file1.txt
- /tmp/bbb/file2.txt
- /tmp/ccc/file3.txt
How can I specify the input path to MapReduce program as a file , so that it can start processing each file inside ? thanks.
In your driver class, you can read in the file, and add each line as a file for input:
//Read allfiles.txt and put each line into a List (requires at least Java 1.7)
List<String> files = Files.readAllLines(Paths.get("allfiles.txt"), StandardCharsets.UTF_8);
/Loop through the file names and add them as input
for(String file : files) {
//This Path is org.apache.hadoop.fs.Path
FileInputFormat.addInputPath(conf, new Path(file));
}
This is assuming that your allfiles.txt is local to the node on which your MR job is being run, but it's only a small change if allfiles.txt is actually on the HDFS.
I strongly recommended that you check that each file exists on the HDFS before you add it as input.
Instead of creating a file with path to other files, you could use globs.
In your example, you could have defined your inputs as -input /tmp/*/file?.txt
My java code lists all code files under a directory of file system, and load each file one by one:
File[] files = mDir.listFiles();
for(File f: files) {
System.out.println(f.getPath());
//load code file
System.load(f);
}
The above code logically looks good, but is not suitable for my case.
My case is that I can NOT load them in a loop one by one, because there are dependencies among those code files. I need to load the files in a specific order according to dependencies.
Say, I already know there are following files under the directory mDir which should be load in the following order:
["dFile", "xFile", "aFile", "hFile"]
and I already got the directory instance mDir .
How can I load files with above order efficiently in java?
If you already know which files you are interested in then just load them in the proper order.
If you have to see which files are available first and then load them in the specific order, then use one loop to get the names of the existing files, then process the list by picking the correct files in the correct order.
I'd suggest just setting the working directory correctly (see Changing the current working directory in Java?) and then doing
for(String fname : fileArray) {
System.load(new File(fname));
}
(where fileArray is the list of file names) or
for(String fname : fileArray) {
System.load(new File(mDir.getPath() + fname));
}
if you're intent on loading from a specific directory.
Other than that, you'd need to divine the dependencies from each file in order, or read the list of files to load from some other source (an array, another file, whatever).
My question is simple. Would Java handle a .zip file with about 450,000 files in there? The code that I wrote would not load all of the files, just one specific file would be searched in the zip, and be read line by line. The file size is about 500kb.
Would this work or will I get an OutOfMemory Exception?
Oh sry, uncompressed there about 0,5MB. Zipped are they whole files about 250mb.
Ok, the name of the Files are IDs + Date(unique) in that zip file. If i have to check a log, ill call Java and give the ID + Date and Java is reading just that one file, never more.
Edit: It works, it works very well. About 400.000 files in a zip, if u have the Memory to Zip the Files works without any problem.
Edit2: It works on Linux Filesystems witout a problem, on NTFS sometimes it crashed. NTFS has a problem with that musch files in 1 Zip.
Using the zip filesystem in Java 7, you can actually access one individual file pretty easily and open a BufferedReader on it.
First you have to create the FileSystem:
public static FileSystem getZipFileSystem(final String zipPath)
{
final Path path = Paths.get(zipPath).toAbsolutePath();
final Map<String, Object> env = new HashMap<>();
final URI uri = URI.create("jar:file:" + path.toString());
return FileSystems.newFileSystem(uri, env, null);
}
Once you have done that, you can create a BufferedReader from an entry in the zip itself:
try (
final FileSystem fs = getZipFileSystem("/path/to/the.zip");
final BufferedReader reader = Files.newBufferedReader(fs.getPath("path/to/entry"),
StandardCharsets.UTF_8);
) {
// operate on the reader
}
You could also read all lines in the entry at once using Files.readAllLines().
If you wish to copy a zip entry to a file on the filesystem, you can also do that:
Files.copy(zipfs.getPath("path/to/entry"), Paths.get("file/on/local/fs"));
Or you can directly copy the result to an OutputStream, or directly create an entry from an OutputStream...
Or even walk the entire zip using Files.walkFileTree().
Or get all the entries in a "directory" in a zip using Files.newDirectoryStream(). Note that as its name says, this is a stream; unlike File.listFiles() (which only works on files on disk anyway), this returns a iterator over the entries.
Or... Or... Or...
Note that a FileSystem needs to be .close()d.
I'm not sure that I understand what you're trying to do.
If it's 0.5 MB/file and 450,000 files, you'll need 225GB. You won't have enough memory to do all this in a single zip in memory even if you get 90% compression.
I'd recommend breaking it into manageable chunks. You'll be able to parallelize that way too, so it's not a bad idea.
This is what I'm trying to accomplish:
1) Calculate the checksum of all files to be added to a zip file. Currently using apache commons io follows:
final Checksum oChecksum = new Adler32();
...
//for every file iFile in folder
long lSum = (FileUtils.checksum(iFile, oChecksum)).getValue();
//store this checksum in a log
2) Compress the folder processed as a zip using the Ant zip task.
3) Extract files from the zip one by one to the specified folder (using both commons io and compression for this), and calculate the checksum of the extracted file:
final Checksum oChecksum = new Adler32();
...
ZipFile myZip = new ZipFile("test.zip");
ZipArchiveEntry zipEntry = myZip.getEntry("checksum.log"); //reads the filename from the log
BufferedInputStream myInputStream = new BufferedInputStream(myZip.getInputStream(zipEntry));
File destFile = new File("/mydir", zipEntry.getName());
lDestFile.createNewFile();
FileUtils.copyInputStreamToFile(myInputStream, destFile);
long newChecksum = FileUtils.checksum(destFile, oChecksum).getValue();
The problem I have is that the value from newChecksum doesn't match the one from the original file. The files' sizes match on disk. Funny thing is that if I run cksum or md5sum commands on both files directly on a terminal, these are the same for both files. The mismatch occurs only from java.
Is this the correct way to approach it or is there any way to preserve the checksum value after extraction?
I also tried using a CheckedInputStream but this also gets me different values from java.
EDIT: This seems related to the Adler32 object used (pre-zip vs unzip checks). If I do "new Adler32()" in the unzip check for every file instead of reusing the same Adler32 for all, I get the correct result.
Are you trying to for all file concatenated? If yes, you need to make sure you're reading them in the same order "checksumed" them.
If no, you need to call checksum.reset() between computing the checksum for each file. You'll notice (in you look at the source) that Adler32 is stateful, which means you're computing the checksum of the file plus all the preceding ones during part one.
The code basically allows the user to input the name of the file that they would like to delete which is held in the variable 'catName' and then the following code is executed to try and find the path of the file and delete it. However, it doesn't seem to work, as it won't delete the file this way. Is does however delete the file if I input the whole path.
File file = new File(catName + ".txt");
String path = file.getCanonicalPath();
File filePath = new File(path);
filePath.delete();
If you're deleting files in the same directory that the program is executing in, you don't need specify a path, but if it's not in the same directory that your program is running in and you're expecting the program to know what directory your file is in, that's not going to happen.
Regarding your code above: the following examples all do the same thing. Let's assume your path is /home/kim/files and that's where you executed the program.
// deletes /home/kim/files/somefile.txt
boolean result = new File("somefile.txt").delete();
// deletes /home/kim/files/somefile.txt
File f = new File("somefile.txt");
boolean result = new File(f.getCanonicalPath()).delete();
// deletes /home/kim/files/somefile.txt
String execPath = System.getProperty("user.dir");
File f = new File(execPath+"/somefile.txt");
f.delete();
In other words, you'll need to specify the path where the deletable files are located. If they are located in different and changing locations, then you'll have to implement a search of your filesystem for the file, which could take a long time if it's a big filesystem. Here's an article on how to implement that.
Depending on what file you want to delete, and where it is stored, chances are that you are expecting Java to magically find the file.
String catName = 'test'
File file = new File(catName + '.txt');
If the program is running in say C:\TestProg\, then the File object is pointing to a file in the location C:\TestProg\test.txt. Since the file object is more of just a helper, it has no issues with pointing to a non-existent file (File can be used to create new files).
If you are trying to delete a file that is in a specific location, then you need to prepend the folder name to the file path, either canonically, or relative to the execution location.
String catName = 'test'
File file = new File('myfiles\\'+ catName +'.txt');
Now file is looking in C:\TestProg\myfiles\test.txt.
If you want to find that file anywhere, then you need a recursive search algorithm, that will traverse the filesystem.
The piece of code that you provided could be compacted to this:
boolean success = new File(catName + ".txt").delete();
The success variable will be true if the deletion was successful. If you do not provide the full absolute path (e.g. C:\Temp\test for the C:\Temp\test.txt file), your program will assume that the path is relative to its current working directory - typically the directory from where it was launched.
You should either provide an absolute path, or a path relative to the current directory. Your program will not try to find the file to delete anywhere else.