how to get input file name in hadoop cascading

how to get input file name in hadoop cascading - java

In map-reduce I would extract the input file name as following
public void map(WritableComparable<Text> key, Text value, OutputCollector<Text,Text> output, Reporter reporter)
throws IOException {
FileSplit fileSplit = (FileSplit)reporter.getInputSplit();
String filename = fileSplit.getPath().getName();
System.out.println("File name "+filename);
System.out.println("Directory and File name"+fileSplit.getPath().toString());
process(key,value);
}
How can I do the similar with cascading
Pipe assembly = new Pipe(SomeFlowFactory.class.getSimpleName());
Function<Object> parseFunc = new SomeParseFunction();
assembly = new Each(assembly, new Fields(LINE), parseFunc);
...
public class SomeParseFunction extends BaseOperation<Object> implements Function<Object> {
...
#Override
public void operate(FlowProcess flowProcess, FunctionCall<Object> functionCall) {
how can I get the input file name here ???
}
Thanks,

I don't use Cascading but I think it should be sufficient to access the context instance, using functionCall.getContext(), to obtain the filename you can use:
String filename= ((FileSplit)context.getInputSplit()).getPath().getName();
However, it seems that cascading use the old API, if the above doesn't work you must try with:
Object name = flowProcess.getProperty( "map.input.file" );

Thank Engineiro for sharing the answer. However, when invoking hfp.getReporter().getInputSplit() method, I got MultiInputSplit type which can't be casted into FileSplit type directly in cascading 2.5.3. After diving into the related cascading APIs, I found a way and retrieved input file names successfully. Therefore, I would like to share this to supplement Engineiro's answer. Please see the following code.
HadoopFlowProcess hfp = (HadoopFlowProcess) flowProcess;
MultiInputSplit mis = (MultiInputSplit) hfp.getReporter().getInputSplit();
FileSplit fs = (FileSplit) mis.getWrappedInputSplit();
String fileName = fs.getPath().getName();

You would do this by getting the reporter within the buffer class, from the provided flowprocess argument in the buffer operate call.
HadoopFlowProcess hfp = (HadoopFlowProcess) flowprocess;
FileSplit fileSplit = (FileSplit)hfp.getReporter().getInputSplit();
.
.//the rest of your code
.

Related

How to fix Fortify Path Manipulation ( Input Validation and Representation , Data Flow ) vulnerability

I am getting fortify path manipulation vulnerability for creating a file with new keyword
I have tried to sanitize the path before passing it to File object, but the problem persists.
Tried this link also:
https://www.securecoding.cert.org/confluence/display/java/FIO00-J.+Do+not+operate+on+files+in+shared+directories
public static String sanitizePath(String sUnsanitized) throws URISyntaxException, EncodingException {
String sSanitized = SAPI.encoder().canonicalize(sUnsanitized);
return sSanitized;
}
//// the main method code snippet /////
String sSanitizedPath = Utils.sanitizePath(file.getOriginalFilename());
-- fortify scan detects problem here ..in below line --
File filePath = new File(AppInitializer.UPLOAD_LOCATION, sSanitizedPath);
String canonicalPath = filePath.getCanonicalPath();
FileOutputStream fileOutputStream = new FileOutputStream(canonicalPath);
After the santizePath , I thought the scan will be not pick ,vulnerabilit but , it did.

This "sUnsanitized" variable comes from user input? Maybe this is your real problem.
Never trust in user input its a number one rule to develpment.

User Defined Function in Pig Latin

I am using Java to create a User Defined Function UDF for Pig Latin in a Hadoop environment. I want to create multiple output files. I have tried to create a Java program to output these CSV files as below:
public String exec(Tuple input)
throws IOException {
if(input.equals("age")){
outputFile = new FileWriter("C:\\UDF\\output_age.csv");
}else{
outputFile = new FileWriter("C:\\UDF\\output_general.csv");
}
}
But this doesn't work. Is there any alternative method to do that, whether by Java or by Pig Latin itself?

While writing the UDFs, you need to take care of the data types. Here exec method takes tuple as input. To read tuple values, you need to use tuple.get(0) notation. i.e.
public String exec(Tuple input)
throws IOException {
String inputAge = input.get(0).toString();
if(inputAge.equals("age")){
// file creation logic
outputFile = new FileWriter("C:\\UDF\\output_age.csv");
}else{
// file creation logic
outputFile = new FileWriter("C:\\UDF\\output_general.csv");
}
}
You can refer Writing Java UDF in Pig for the reference.

Hadoop MapReduce DistributedCache usage

I'm trying to reproduce the Bloom Filtering example of MapReduce Design Pattern book.
In the following, I will show only the code of interest:
public static class BloomFilteringMapper extends Mapper<Object, Text, Text, NullWritable>
{
private BloomFilter filter = new BloomFilter();
protected void setup( Context context ) throws IOException
{
URI[] files = DistributedCache.getCacheFiles( context.getConfiguration() );
String path = files[0].getPath();
System.out.println( "Reading Bloom Filter from: " + path );
DataInputStream strm = new DataInputStream( new FileInputStream( path ) );
filter.readFields( strm );
strm.close();
}
//...
}
public static void main( String[] args ) throws Exception
{
Job job = new Job( new Configuration(), "description" );
URI uri = new URI("hdfs://localhost:9000/user/draxent/comment.bloomfilter");
DistributedCache.addCacheFile( uri, job.getConfiguration() );
//...
}
When I try to execute it, I receive the following error:
java.io.FileNotFoundException: /user/draxent/comment.bloomfilter
But executing the command:
bin/hadoop fs -ls
I can see the file:
-rw-r--r-- 1 draxent supergroup 405 2015-11-25 17:12 /user/draxent/comment.bloomfilter
So I am quite sure the problem is on the line:
URI uri = new URI("hdfs://localhost:9000/user/draxent/comment.bloomfilter");
But I have tried several different configuration, like:
"hdfs://user/draxent/comment.bloomfilter"
"/user/draxent/comment.bloomfilter"
"comment.bloomfilter"
And no one works.
I have tried to look up at the cfeduke implementation, but I was no able to solve my problem.
Answer comments:
ravindra: URI files[0] contains the string element passed in the main;
Manjunath Ballur: yea, you are right. But since the file exists (you can see it from bin/hadoop fs -ls) this means that is a problem of the string path passed to FileInputStream. But I'm passing the string to that like always. I checked, the path value is: comment.bloomfilter... so it has to be right.

The Distributed Cache API has been deprecated.
You can extend the same functionality using the new API.Check the documentation here: http://hadoop.apache.org/docs/stable2/api/org/apache/hadoop/mapreduce/Job.html
In the driver code:-
Job job = new Job();
...
job.addCacheFile(new Path(filename).toUri());
In the mapper setup method :-
Path[] localPaths = context.getLocalCacheFiles();

The following should work:
remove the line with URI uri = new URI(... and change the next line to:
DistributedCache.addCacheFile(new Path("/user/draxent/comment.bloomfilter").toUri(), job.getConfiguration());

Read PDVInputStream dicomObject information on onCStoreRQ association request

I am trying to read (and then store to 3rd party local db) certain DICOM object tags "during" an incoming association request.
For accepting association requests and storing locally my dicom files i have used a modified version of dcmrcv() tool. More specifically i have overriden onCStoreRQ method like:
#Override
protected void onCStoreRQ(Association association, int pcid, DicomObject dcmReqObj,
PDVInputStream dataStream, String transferSyntaxUID,
DicomObject dcmRspObj)
throws DicomServiceException, IOException {
final String classUID = dcmReqObj.getString(Tag.AffectedSOPClassUID);
final String instanceUID = dcmReqObj.getString(Tag.AffectedSOPInstanceUID);
config = new GlobalConfig();
final File associationDir = config.getAssocDirFile();
final String prefixedFileName = instanceUID;
final String dicomFileBaseName = prefixedFileName + DICOM_FILE_EXTENSION;
File dicomFile = new File(associationDir, dicomFileBaseName);
assert !dicomFile.exists();
final BasicDicomObject fileMetaDcmObj = new BasicDicomObject();
fileMetaDcmObj.initFileMetaInformation(classUID, instanceUID, transferSyntaxUID);
final DicomOutputStream outStream = new DicomOutputStream(new BufferedOutputStream(new FileOutputStream(dicomFile), 600000));
//i would like somewhere here to extract some TAGS from incoming dicom object. By trying to do it using dataStream my dicom files
//are getting corrupted!
//System.out.println("StudyInstanceUID: " + dataStream.readDataset().getString(Tag.StudyInstanceUID));
try {
outStream.writeFileMetaInformation(fileMetaDcmObj);
dataStream.copyTo(outStream);
} finally {
outStream.close();
}
dicomFile.renameTo(new File(associationDir, dicomFileBaseName));
System.out.println("DICOM file name: " + dicomFile.getName());
}
#Override
public void associationAccepted(final AssociationAcceptEvent associationAcceptEvent) {
....
#Override
public void associationClosed(final AssociationCloseEvent associationCloseEvent) {
...
}
I would like somewhere between this code to intercept a method wich will read dataStream and will parse specific tags and store to a local database.
However wherever i try to put a piece of code that tries to manipulate (just read for start) dataStream then my dicom files get corrupted!
PDVInputStream is implementing java.io.InputStream ....
Even if i try to just put a:
System.out.println("StudyInstanceUID: " + dataStream.readDataset().getString(Tag.StudyInstanceUID));
before copying datastream to outStream ... then my dicom files are getting corrupted (1KB of size) ...
How am i supposed to use datastream in a CStoreRQ association request to extract some information?
I hope my question is clear ...

The PDVInputStream is probably a PDUDecoder class. You'll have to reset the position when using the input stream multiple times.
Maybe a better solution would be to store the DICOM object in memory and use that for both purposes. Something akin to:
DicomObject dcmobj = dataStream.readDataset();
String whatYouWant = dcmobj.get( Tag.whatever );
dcmobj.initFileMetaInformation( transferSyntaxUID );
outStream.writeDicomFile( dcmobj );

Java: How to I change the configuration file value in Java easily?

I have a config file, named config.txt, look like this.
IP=192.168.1.145
PORT=10022
URL=http://www.stackoverflow.com
I wanna change some value of the config file in Java, say the port to 10045. How can I achieve easily?
IP=192.168.1.145
PORT=10045
URL=http://www.stackoverflow.com
In my trial, i need to write lots of code to read every line, to find the PORT, delete the original 10022, and then rewrite 10045. my code is dummy and hard to read. Is there any convenient way in java?
Thanks a lot !

If you want something short you can use this.
public static void changeProperty(String filename, String key, String value) throws IOException {
Properties prop =new Properties();
prop.load(new FileInputStream(filename));
prop.setProperty(key, value);
prop.store(new FileOutputStream(filename),null);
}
Unfortunately it doesn't preserve the order or fields or any comments.
If you want to preserve order, reading a line at a time isn't so bad.
This untested code would keep comments, blank lines and order. It won't handle multi-line values.
public static void changeProperty(String filename, String key, String value) throws IOException {
final File tmpFile = new File(filename + ".tmp");
final File file = new File(filename);
PrintWriter pw = new PrintWriter(tmpFile);
BufferedReader br = new BufferedReader(new FileReader(file));
boolean found = false;
final String toAdd = key + '=' + value;
for (String line; (line = br.readLine()) != null; ) {
if (line.startsWith(key + '=')) {
line = toAdd;
found = true;
}
pw.println(line);
}
if (!found)
pw.println(toAdd);
br.close();
pw.close();
tmpFile.renameTo(file);
}

My suggestion would be to read the entire config file into memory (maybe into a list of (attribute:value) pair objects), do whatever processing you need to do (and consequently make any changes), then overwrite the original file with all the changes you have made.
For example, you could read the config file you have provided by line, use String.split("=") to separate the attribute:value pairs - making sure to name each pair read accordingly. Then make whatever changes you need, iterate over the pairs you have read in (and possibly modified), writing them back out to the file.
Of course, this approach would work best if you had a relatively small number of lines in your config file, that you can definitely know the format for.

this code work for me.
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.PrintWriter;
import java.util.Properties;
public void setProperties( String key, String value) throws IOException {
Properties prop = new Properties();
FileInputStream ip;
try {
ip = new FileInputStream("config.txt");
prop.load(ip);
} catch (FileNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
prop.setProperty(key, value);
PrintWriter pw = new PrintWriter("config.txt");
prop.store(pw, null);
}

Use the Properties class to load/save configuration. Then simply set the value and save it again.
Properties p = new Properties();
p.load(...);
p.put("key", "value");
p.save(...)
It's easy and straightforward.
As a side, if your application is a single application that does not need to scale to run on multiple computers, do not bother to use a database to save config. It is utter overkill. However, if you application needs real time config changes and needs to scale, Redis works pretty well to distribute config and handle the synchronization for you. I have used it for this purpose with great success.

Consider using java.util.Properties and it's load() and store() methods.
But remember that this would not preserve comments and extra line breaks in the file.
Also certain chars need to be escaped.

If you are open to use third party libraries, explore http://commons.apache.org/configuration/. It supports configurations in multiple format. Comments will be preserved as well. (Except for a minor bug -- apache-commons-config PropertiesConfiguration: comments after last property is lost)

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

how to get input file name in hadoop cascading - java

You would do this by getting the reporter within the buffer class, from the provided flowprocess argument in the buffer operate call. HadoopFlowProcess hfp = (HadoopFlowProcess) flowprocess; FileSplit fileSplit = (FileSplit)hfp.getReporter().getInputSplit(); . .//the rest of your code .

Related

How to fix Fortify Path Manipulation ( Input Validation and Representation , Data Flow ) vulnerability

User Defined Function in Pig Latin

Hadoop MapReduce DistributedCache usage

Read PDVInputStream dicomObject information on onCStoreRQ association request

Java: How to I change the configuration file value in Java easily?

Categories

Resources