Hadoop MapReduce DistributedCache usage

Hadoop MapReduce DistributedCache usage - java

I'm trying to reproduce the Bloom Filtering example of MapReduce Design Pattern book.
In the following, I will show only the code of interest:
public static class BloomFilteringMapper extends Mapper<Object, Text, Text, NullWritable>
{
private BloomFilter filter = new BloomFilter();
protected void setup( Context context ) throws IOException
{
URI[] files = DistributedCache.getCacheFiles( context.getConfiguration() );
String path = files[0].getPath();
System.out.println( "Reading Bloom Filter from: " + path );
DataInputStream strm = new DataInputStream( new FileInputStream( path ) );
filter.readFields( strm );
strm.close();
}
//...
}
public static void main( String[] args ) throws Exception
{
Job job = new Job( new Configuration(), "description" );
URI uri = new URI("hdfs://localhost:9000/user/draxent/comment.bloomfilter");
DistributedCache.addCacheFile( uri, job.getConfiguration() );
//...
}
When I try to execute it, I receive the following error:
java.io.FileNotFoundException: /user/draxent/comment.bloomfilter
But executing the command:
bin/hadoop fs -ls
I can see the file:
-rw-r--r-- 1 draxent supergroup 405 2015-11-25 17:12 /user/draxent/comment.bloomfilter
So I am quite sure the problem is on the line:
URI uri = new URI("hdfs://localhost:9000/user/draxent/comment.bloomfilter");
But I have tried several different configuration, like:
"hdfs://user/draxent/comment.bloomfilter"
"/user/draxent/comment.bloomfilter"
"comment.bloomfilter"
And no one works.
I have tried to look up at the cfeduke implementation, but I was no able to solve my problem.
Answer comments:
ravindra: URI files[0] contains the string element passed in the main;
Manjunath Ballur: yea, you are right. But since the file exists (you can see it from bin/hadoop fs -ls) this means that is a problem of the string path passed to FileInputStream. But I'm passing the string to that like always. I checked, the path value is: comment.bloomfilter... so it has to be right.

The Distributed Cache API has been deprecated.
You can extend the same functionality using the new API.Check the documentation here: http://hadoop.apache.org/docs/stable2/api/org/apache/hadoop/mapreduce/Job.html
In the driver code:-
Job job = new Job();
...
job.addCacheFile(new Path(filename).toUri());
In the mapper setup method :-
Path[] localPaths = context.getLocalCacheFiles();

The following should work:
remove the line with URI uri = new URI(... and change the next line to:
DistributedCache.addCacheFile(new Path("/user/draxent/comment.bloomfilter").toUri(), job.getConfiguration());

Related

How to fix Fortify Path Manipulation ( Input Validation and Representation , Data Flow ) vulnerability

I am getting fortify path manipulation vulnerability for creating a file with new keyword
I have tried to sanitize the path before passing it to File object, but the problem persists.
Tried this link also:
https://www.securecoding.cert.org/confluence/display/java/FIO00-J.+Do+not+operate+on+files+in+shared+directories
public static String sanitizePath(String sUnsanitized) throws URISyntaxException, EncodingException {
String sSanitized = SAPI.encoder().canonicalize(sUnsanitized);
return sSanitized;
}
//// the main method code snippet /////
String sSanitizedPath = Utils.sanitizePath(file.getOriginalFilename());
-- fortify scan detects problem here ..in below line --
File filePath = new File(AppInitializer.UPLOAD_LOCATION, sSanitizedPath);
String canonicalPath = filePath.getCanonicalPath();
FileOutputStream fileOutputStream = new FileOutputStream(canonicalPath);
After the santizePath , I thought the scan will be not pick ,vulnerabilit but , it did.

This "sUnsanitized" variable comes from user input? Maybe this is your real problem.
Never trust in user input its a number one rule to develpment.

7zip cmd extracting using ProcessBuilder in java

I have a problem extracting an archive to the desired category using Java 10 ProcessBuilder and 7z.exe (18.05) with command line. The exact same command works as intended when I use Windows CMD, but no longer functions when issued by my JavaFX application using ProcessBuilder:
public static void decompress7ZipEmbedded(File source, File destination) throws IOException, InterruptedException {
ProcessBuilder pb = new ProcessBuilder(
getSevenZipExecutablePath(),
EXTRACT_WITH_FULL_PATHS_COMMAND,
quotifyPath(source.getAbsolutePath()),
OUTPUT_DIRECTORY_SWITCH + quotifyPath(destination.getAbsolutePath())
);
processWithSevenZipEmbedded(pb);
}
private static void processWithSevenZipEmbedded(ProcessBuilder pb) throws IOException, InterruptedException {
LOG.info("7-zip command issued: " + String.join(" ", pb.command()));
Process p = pb.start();
new Thread(new InputConsumer(p.getInputStream())).start();
System.out.println("Exited with: " + p.waitFor());
}
public static class InputConsumer implements Runnable {
private InputStream is;
InputConsumer(InputStream is) {
this.is = is;
}
#Override
public void run() {
try {
int value = -1;
while ((value = is.read()) != -1) {
System.out.print((char) value);
}
} catch (IOException exp) {
exp.printStackTrace();
}
LOG.debug("Output stream completed");
}
}
public static String getSevenZipExecutablePath() {
return FileUtil.quotifyPath(getDirectory() + "7z" + "/" + "7z");
}
public static String quotifyPath(String path) {
return '"' + path + '"';
}
public class Commands {
public static final String EXTRACT_COMMAND = "e";
public static final String EXTRACT_WITH_FULL_PATHS_COMMAND = "x";
public static final String PACK_COMMAND = "a";
public static final String DELETE_COMMAND = "d";
public static final String BENCHMARK_COMMAND = "b";
public static final String LIST_COMMAND = "l";
}
public class Switches {
public static final String OUTPUT_DIRECTORY_SWITCH = "-o";
public static final String RECURSIVE_SWITCH = "-r";
public static final String ASSUME_YES = "y";
}
The command looks like this:
"C:/Users/blood/java_projects/AppRack/target/classes/7z/7z" x "D:\Pulpit\AppRack Sandbox\test\something\Something 2\Something2.7z" -o"D:\Pulpit\AppRack Sandbox\Something2"
And the output from ProcessBuilder:
7-Zip 18.05 (x64) : Copyright (c) 1999-2018 Igor Pavlov : 2018-04-30
Scanning the drive for archives:
1 file, 59177077 bytes (57 MiB)
Extracting archive: D:\Pulpit\AppRack Sandbox\test\Something\Something 2\Something2.7z
--
Path = D:\Pulpit\AppRack Sandbox\test\Something\Something 2\Something2.7z
Type = 7z
Physical Size = 5917Exited with: 0
7077
Headers Size = 373
Method = LZMA2:26 LZMA:20 BCJ2
Solid = +
Blocks = 2
No files to process
Everything is Ok
Files: 0
Size: 0
Compressed: 59177077
It doesn't do ANYTHING. Doesn't create a desired folder, nothing. Using CMD it works like a charm (here log from Windows 10 CMD using the same command):
7-Zip 18.05 (x64) : Copyright (c) 1999-2018 Igor Pavlov : 2018-04-30
Scanning the drive for archives:
1 file, 59177077 bytes (57 MiB)
Extracting archive: D:\Pulpit\AppRack Sandbox\test\Something\Something 2\Something2.7z
--
Path = D:\Pulpit\AppRack Sandbox\test\Something\Something 2\Something2.7z
Type = 7z
Physical Size = 59177077
Headers Size = 373
Method = LZMA2:26 LZMA:20 BCJ2
Solid = +
Blocks = 2
Everything is Ok
Folders: 1
Files: 5
Size: 64838062
Compressed: 59177077
Do you have any idea what causes a difference here and why it says "No files to process, everything is ok" without doing anything? I've tried already to create a folder first using File class but it doesn't seem to be an issue because the results are the same whether the destination folder exists prior to extracting or not.
I've already tried everything that has come to my mind and I run out of ideas at the moment. Please share with me any suggestions that you may have regarding this issue. Thanks a lot.

Thank you very much for your help.
Don’t quote your arguments. Quotes are for the command shell’s benefit. ProcessBuilder is not a command shell; it executes a command directly, so any quotes are seen as part of the argument itself (that is, the file name). Also, pb.inheritIO(); is a better way to see the output of the child process than manually consuming process streams.
Thank you #VGR it seemed to be the issue - after I remove the method to quote paths in the mentioned command it works like a charm and extracting archive without any problem! So the conclusion is I shouldn't have used quotes in paths while using Java ProcessBuilder.
I've also used pb.inheritIO() and you are right it is much better and easier to manage it this way.
public static void decompress7ZipEmbedded(File source, File destination) throws IOException {
ProcessBuilder pb = new ProcessBuilder().inheritIO().command(
getSevenZipExecutablePath(),
EXTRACT_WITH_FULL_PATHS_COMMAND,
source.getAbsolutePath(),
OUTPUT_DIRECTORY_SWITCH + destination.getAbsolutePath(),
OVERWRITE_WITHOUT_PROMPT
);
processWithSevenZipEmbedded(pb);
}
private static void processWithSevenZipEmbedded(ProcessBuilder pb) throws IOException {
LOG.info("7-zip command issued: " + String.join(" ", pb.command()));
pb.start();
}
public class Commands {
public static final String EXTRACT_WITH_FULL_PATHS_COMMAND = "x";
}
public class Switches {
public static final String OUTPUT_DIRECTORY_SWITCH = "-o";
public static final String OVERWRITE_WITHOUT_PROMPT = "-aoa";
}
Double click on file 7zip.chm or start 7-Zip and open the Help and read the help page Command Line Version - Syntax with first line 7z [...] [...]. There is clearly explained that first the command x must be specified, next should be the switches like -o with best last switch being --, then the archive file name and last further arguments like names of files/folders to extract. Switches can be also specified after archive file name, but that is not recommended although examples on help page for -o are also with -o at end.
Thank you #Mofi for the tip. I used -aoa switch instead of -y and it finally started to work as I wanted - to overwrite files without any prompt. I left the rest of the command the way it was as it works as intended, so it finally looks like this:
C:/Users/blood/java_projects/AppRack/target/classes/7z/7z" x D:\Pulpit\AppRack Sandbox\test\Test\Test 2\Test.7z -oD:\Desktop\AppRack Sandbox\Test 2 -aoa
Thanks a lot for help once again!

Linking parameter in URL to do required function in java

I have a java web service which executes a batch file. I have finished my controller class. In the controller class, there is a parameter and the variable is String fileName. I am not sure how to code to make the fileName to carry out its function.
I will show my codes then explain about what fileName is supposed to do.
RunBatchFile.java
public ResultFormat runBatch(String fileName) {
String var = fileName;
String filePath = ("C:/Users/attsuap1/Desktop" + var);
try {
Process p = Runtime.getRuntime().exec(filePath);
int exitVal = p.waitFor();
return new ResultFormat(exitVal == 0);
} catch (Exception e) {
e.printStackTrace();
return new ResultFormat(false);
}
}
BatchFileController.java
private static final String template = "Sum, %s!";
#RequestMapping("/runbatchfileparam/{param}")
public ResultFormat runbatchFile(#PathVariable("param") String fileName) {
RunBatchFile rbf = new RunBatchFile();
return rbf.runBatch(fileName);
}
When the user types in http://localhost:8080/runbatchfileparam/test.bat as URL, the test.bat file must be executed. When the user types in test123.bat instead of test.bat, the test123.batch file must be executed. Therefore i cannot code the String filePath to be "C:/Users/attsuap1/Desktop/test.bat" as that will execute the test.bat file. I want to allow users to choose the batch file that they want to execute. I think this is simple to achieve however i am not sure on how to do that.
How do i code to link the String fileName variable to make it carry out what it is supposed to do? I tried some ways however they do not give the results that i want.
Someone please do help me thank you so much.

I think you have already done the job here.
There are two ways that you can execute the desired batch file.
Dont include ".bat" in URL param,
Rather append .bat in your runBatch() method, to whatever file you get in the param
e.g.
public ResultFormat runBatch(String fileName) {
String var = fileName;
String filePath = ("C:/Users/attsuap1/Desktop/" + var+".bat");
try {
Process p = Runtime.getRuntime().exec(filePath);
But if you do want to include .bat in your URl, you will have to use the below regex-mapping so that spring does not ignore the filename trailing the dot.
#RequestMapping("/runbatchfileparam/{param:.+}")
public ResultFormat runbatchFile(#PathVariable("param") String fileName)
{
RunBatchFile rbf = new RunBatchFile();
return rbf.runBatch(fileName);
}

how to get input file name in hadoop cascading

In map-reduce I would extract the input file name as following
public void map(WritableComparable<Text> key, Text value, OutputCollector<Text,Text> output, Reporter reporter)
throws IOException {
FileSplit fileSplit = (FileSplit)reporter.getInputSplit();
String filename = fileSplit.getPath().getName();
System.out.println("File name "+filename);
System.out.println("Directory and File name"+fileSplit.getPath().toString());
process(key,value);
}
How can I do the similar with cascading
Pipe assembly = new Pipe(SomeFlowFactory.class.getSimpleName());
Function<Object> parseFunc = new SomeParseFunction();
assembly = new Each(assembly, new Fields(LINE), parseFunc);
...
public class SomeParseFunction extends BaseOperation<Object> implements Function<Object> {
...
#Override
public void operate(FlowProcess flowProcess, FunctionCall<Object> functionCall) {
how can I get the input file name here ???
}
Thanks,

I don't use Cascading but I think it should be sufficient to access the context instance, using functionCall.getContext(), to obtain the filename you can use:
String filename= ((FileSplit)context.getInputSplit()).getPath().getName();
However, it seems that cascading use the old API, if the above doesn't work you must try with:
Object name = flowProcess.getProperty( "map.input.file" );

Thank Engineiro for sharing the answer. However, when invoking hfp.getReporter().getInputSplit() method, I got MultiInputSplit type which can't be casted into FileSplit type directly in cascading 2.5.3. After diving into the related cascading APIs, I found a way and retrieved input file names successfully. Therefore, I would like to share this to supplement Engineiro's answer. Please see the following code.
HadoopFlowProcess hfp = (HadoopFlowProcess) flowProcess;
MultiInputSplit mis = (MultiInputSplit) hfp.getReporter().getInputSplit();
FileSplit fs = (FileSplit) mis.getWrappedInputSplit();
String fileName = fs.getPath().getName();

You would do this by getting the reporter within the buffer class, from the provided flowprocess argument in the buffer operate call.
HadoopFlowProcess hfp = (HadoopFlowProcess) flowprocess;
FileSplit fileSplit = (FileSplit)hfp.getReporter().getInputSplit();
.
.//the rest of your code
.

Java 7 zip file system provider doesn't seem to accept spaces in URI

I have been testing all possible variations and permutations, but I can't seem to construct a FileSystemProvider with the zip/jar scheme for a path (URI) that contains spaces. There is a very simplistic test case available at Oracle Docs. I took the liberty of modifying the example and just adding spaces to the URI, and it stops working. Snippet below:
import java.util.*;
import java.net.URI;
import java.nio.file.*;
public class Test {
public static void main(String [] args) throws Throwable {
Map<String, String> env = new HashMap<>();
env.put("create", "true");
URI uri = new URI("jar:file:/c:/dir%20with%20spaces/zipfstest.zip");
Path dir = Paths.get("C:\\dir with spaces");
if(Files.exists(dir) && Files.isDirectory(dir)) {
try (FileSystem zipfs = FileSystems.newFileSystem(uri, env)) {}
}
}
}
When I execute this code (Windows, JDK7u2, both x32 and x64), I get the following exception:
java.lang.IllegalArgumentException: Illegal character in path at index 12: file:/c:/dir with spaces/zipfstest.zip
at com.sun.nio.zipfs.ZipFileSystemProvider.uriToPath(ZipFileSystemProvider.java:87)
at com.sun.nio.zipfs.ZipFileSystemProvider.newFileSystem(ZipFileSystemProvider.java:107)
at java.nio.file.FileSystems.newFileSystem(FileSystems.java:322)
at java.nio.file.FileSystems.newFileSystem(FileSystems.java:272)
If I use + instead of %20 as the space escape character, a different exception is thrown:
java.nio.file.NoSuchFileException: c:\dir+with+spaces\zipfstest.zip
at sun.nio.fs.WindowsException.translateToIOException(WindowsException.java:79)
at sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:97)
at sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:102)
at sun.nio.fs.WindowsFileSystemProvider.newByteChannel(WindowsFileSystemProvider.java:229)
at java.nio.file.spi.FileSystemProvider.newOutputStream(FileSystemProvider.java:430)
at java.nio.file.Files.newOutputStream(Files.java:170)
at com.sun.nio.zipfs.ZipFileSystem.<init>(ZipFileSystem.java:116)
at com.sun.nio.zipfs.ZipFileSystemProvider.newFileSystem(ZipFileSystemProvider.java:117)
at java.nio.file.FileSystems.newFileSystem(FileSystems.java:322)
at java.nio.file.FileSystems.newFileSystem(FileSystems.java:272)
I might be missing something very obvious, but would this indicate a problem with the supplied ZIP/JAR file system provider?
EDIT:
Another use case based on a File object, as requested in coments:
import java.io.File;
import java.io.UnsupportedEncodingException;
import java.net.URI;
import java.nio.file.FileSystems;
import java.util.HashMap;
import java.util.Map;
import java.util.logging.Level;
import java.util.logging.Logger;
public class Test {
public static void main(String[] args) throws UnsupportedEncodingException {
try {
File zip = new File("C:\\dir with spaces\\file.zip");
URI uri = URI.create("jar:" + zip.toURI().toURL());
Map<String, String> env = new HashMap<>();
env.put("create", "true");
if(zip.getParentFile().exists() && zip.getParentFile().isDirectory()) {
FileSystems.newFileSystem(uri, env);
}
} catch (Exception ex) {
Logger.getAnonymousLogger().log(Level.SEVERE, null, ex);
System.out.println();
}
}
}
The exception is thrown again as:
java.lang.IllegalArgumentException: Illegal character in path at index 12: file:/C:/dir with spaces/file.zip
at com.sun.nio.zipfs.ZipFileSystemProvider.uriToPath(ZipFileSystemProvider.java:87)
at com.sun.nio.zipfs.ZipFileSystemProvider.newFileSystem(ZipFileSystemProvider.java:107)
at java.nio.file.FileSystems.newFileSystem(FileSystems.java:322)
at java.nio.file.FileSystems.newFileSystem(FileSystems.java:272)

Actually further analysis does seem to indicate there is a problem with the ZipFileSystemProvider. The uriToPath(URI uri) method contained within the class executes the following snippet:
String spec = uri.getSchemeSpecificPart();
int sep = spec.indexOf("!/");
if (sep != -1)
spec = spec.substring(0, sep);
return Paths.get(new URI(spec)).toAbsolutePath();
From the JavaDocs of URI.getSchemeSpecificPart() we can see the following:
The string returned by this method is equal to that returned by the
getRawSchemeSpecificPart method except that all sequences of escaped
octets are decoded.
This same string is then passed back as an argument into the new URI() constructor. Since any escaped octets are de-escaped by getSchemeSpecificPart(), if the original URI contained any escape characters, they will not be propagated to the new URI - hence the exception.
A potential workaround - loop through all the available filesystem providers and get the reference to the one who's spec equals "jar". Then use that to create a new filesystem based on path only.

This is a bug in Java 7 and it has been marked as fixed in Java 8 (see Bug ID 7156873). The fix should also be backported to Java 7, but at the moment it's not determined that which update will have it (see Bug ID 8001178).

The jar: URIs should have the escaped zip-URI in its scheme-specific part, so your jar: URI is simply wrong - it should rightly be double-escaped, as the jar: scheme is composed of the host URI, !/ and the local path.
However, this escaping is only implied and not expressed by the minimal URL "specification" in JarURLConnection. I agree however with the raised bug in JRE that it should still accept single-escaped, although that could lead to some strange edge-cases not being supported.
As pointed out by tornike and evermean in another answer, the easiest is to do FileSystems.newFileSystem(path, null) - but this does not work when you want to pass and env with say "create"=true.
Instead, create the jar: URI using the component-based constructor:
URI jar = new URI("jar", path.toUri().toString(), null);
This would properly encode the scheme-specific part.
As a JUnit test, which also confirms that this is the escaping used when opening from a Path:
#Test
public void jarWithSpaces() throws Exception {
Path path = Files.createTempFile("with several spaces", ".zip");
Files.delete(path);
// Will fail with FileSystemNotFoundException without env:
//FileSystems.newFileSystem(path, null);
// Neither does this work, as it does not double-escape:
// URI jar = URI.create("jar:" + path.toUri().toASCIIString());
URI jar = new URI("jar", path.toUri().toString(), null);
assertTrue(jar.toASCIIString().contains("with%2520several%2520spaces"));
Map<String, Object> env = new HashMap<>();
env.put("create", "true");
try (FileSystem fs = FileSystems.newFileSystem(jar, env)) {
URI root = fs.getPath("/").toUri();
assertTrue(root.toString().contains("with%2520several%2520spaces"));
}
// Reopen from now-existing Path to check that the URI is
// escaped in the same way
try (FileSystem fs = FileSystems.newFileSystem(path, null)) {
URI root = fs.getPath("/").toUri();
//System.out.println(root.toASCIIString());
assertTrue(root.toString().contains("with%2520several%2520spaces"));
}
}
(I did a similar test with "with\u2301unicode\u263bhere" to check that I did not need to use .toASCIIString())

There are two methods to create a filesystem:
FileSystem fs = FileSystems.newFileSystem(uri, env);
FileSystem fs = FileSystems.newFileSystem(zipfile, null);
When there is a space in a filename together with the above solution for creating a uri. It also works if you use a different method that doesn't take a uri as argument.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Hadoop MapReduce DistributedCache usage - java

The following should work: remove the line with URI uri = new URI(... and change the next line to: DistributedCache.addCacheFile(new Path("/user/draxent/comment.bloomfilter").toUri(), job.getConfiguration());

Related

How to fix Fortify Path Manipulation ( Input Validation and Representation , Data Flow ) vulnerability

7zip cmd extracting using ProcessBuilder in java

Linking parameter in URL to do required function in java

how to get input file name in hadoop cascading

Java 7 zip file system provider doesn't seem to accept spaces in URI

Categories

Resources