User Defined Function in Pig Latin

User Defined Function in Pig Latin - java

I am using Java to create a User Defined Function UDF for Pig Latin in a Hadoop environment. I want to create multiple output files. I have tried to create a Java program to output these CSV files as below:
public String exec(Tuple input)
throws IOException {
if(input.equals("age")){
outputFile = new FileWriter("C:\\UDF\\output_age.csv");
}else{
outputFile = new FileWriter("C:\\UDF\\output_general.csv");
}
}
But this doesn't work. Is there any alternative method to do that, whether by Java or by Pig Latin itself?

While writing the UDFs, you need to take care of the data types. Here exec method takes tuple as input. To read tuple values, you need to use tuple.get(0) notation. i.e.
public String exec(Tuple input)
throws IOException {
String inputAge = input.get(0).toString();
if(inputAge.equals("age")){
// file creation logic
outputFile = new FileWriter("C:\\UDF\\output_age.csv");
}else{
// file creation logic
outputFile = new FileWriter("C:\\UDF\\output_general.csv");
}
}
You can refer Writing Java UDF in Pig for the reference.

Related

Writing Strings to a binary file java

I have a list of objects that has some simple String properties. I want to be able to save those strings to binary so that when you open the file outside the program, you only see 1's and 0's.
I have managed to use FileOutputStreamand saved the strings, however, I can't manage to get it to write to binary. The file reads as clean readable text. I have tried getBytes().
What would be the best approach for this? Keep in mind that I want to be able to read the file later and construct back the objects. Would it be better to use Serializable and save a list of objects?
Here is my FileWriter:
NB: The toString() is custom and returns a String with linebreaks for every property.
public class FileWriter {
public void write(String fileName, Savable objectToSave ) throws IOException {
File fileToSave = new File(fileName);
String stringToSave = objectToSave.toString();
byte[] bytesToSave = stringToSave.getBytes(StandardCharsets.UTF_8) ;
try (
OutputStream outputStream = new FileOutputStream(fileToSave);
) {
outputStream.write(bytesToSave);
} catch (IOException e) {
throw new IOException("error");
}
}
}

If your goal is simply serializing, implementing Serializable and writing them would work, but your string is still going to be readable. You can encrypt the stream, but anyone decompiling your code can still devise a way to read the values.

how to write to a new line of CSV file

I am trying to benchmark sorting methods. My writeCSV(String) method writes over the first line every time I call it. Here is my code:
public static void main(String[] args) throws Exception{
writeCSV("data size (100 times),bubble,insertion,merge,quick");
sortRandomSet(20);
}
public static void sortRandomSet(int setSize) throws Exception
{
.
.
.
writeCSV(setSize+","+bTime+","+mTime+","+iTime+","+qTime);
}
/*******************************************************************************
* writeCSV(course[]) method
* Last edited by Steve Pesce 3/19/2014 for CSci 112
* Writes String to CSV
*
*/
public static void writeCSV(String line) throws Exception {
//create new File object
java.io.File courseCSV = new java.io.File("benchmark.csv");
//create PrintWriter object on new File object
java.io.PrintWriter outfile = new java.io.PrintWriter(courseCSV);
outfile.write(line + "\n");
outfile.close();
}//end writeCSV(String)
I want writeCSV to start on a new line every time it is called, is this possible to do?

You could just use the append method. This will append your input to the end of the file.

Use java.io.FileWriter instead:
java.io.FileWriter outfile = new java.io.FileWriter("benchmark.csv", true); //true = append
outfile.write(line+"\n");

Yeah, right after your call function you can add a string to the first line, have you tried that?
Also, when you create a new file add "a" as an argument which stands for append.
Try using RandomAccessFile
Have a look at this, it should explain how to add things to selected line in a text file

You should use "append method for this"

Insert a string in the middle of text file without replacing [duplicate]

This question already has answers here:
inserting data in the middle of a text file through java
(2 answers)
Closed 9 years ago.
Suppose i have a text file named Sample.text.
i need advice on how to achieve this:
Sample.txt before running a program:
ABCD
while running the program, user will input string to be added starting at the middle
for example: user input is XXX
Sample.txt after running a program:
ABXXXCD

Basically you've got to rewrite the file, at least from the middle. This isn't a matter of Java - it's a matter of what file systems support.
Typically the way to do this is to open both the input file and an output file, then:
Copy the first part from the input file to the output file
Write the middle section to the output file
Copy the remainder of the input file to the output file
Optionally perform file renaming if you want the new file to have the same eventual name as the original file

The basic idea is to read the file contents into memory, say at program start, manipulate the string as desired, then write the entire thing back to the file.
So you would open and read in Sample.txt. In memory you have a string = "ABCD"
in your program execution, accept user input of XXX. Insert that into your string with your favorite string manipulation method. Now string = "ABXXXCD"
Finally you would overwrite Sample.txt with your updated string and close it.
If you were worried about corruption or something, you might save it to a secondary file, then verify its contents, delete the original, and rename the new to be the same as the original.

Actually i have did something like what you want, here try this code, its not a complete but it should give you a clear idea:
public void addString(String fileContent, String insertData) throws IOException {
String firstPart = getFirstPart(fileContent);
Pattern p = Pattern.compile(firstPart);
Matcher matcher = p.matcher(fileContent);
int end = 0;
boolean matched = matcher.find();
if (matched) {
end = matcher.end();
}
if(matched) {
String secondPart = fileContent.substring(end);
StringBuilder newFileContent = new StringBuilder();
newFileContent.append(firstPart);
newFileContent.append(insertData);
newFileContent.append(secondPart);
writeNewFileContent(newFileContent.toString());
}
}

Normally a new file would be created, but the following probably suffices (for non-gigabyte files). Mind the explicit encoding UTF-8; which you can ommit for the encoding of the operating system.
public static void insertInMidstOfFile(File file, String textToInsert)
throws IOException {
if (!file.exists()) {
throw new FileNotFoundException("File not found: " + file.getPath());
// Because file open mode "rw" would create it.
}
if (textToInsert.isEmpty()) {
return;
}
long fileLength = file.length();
long startPosition = fileLength / 2;
long remainingLength = fileLength - startPosition;
if (remainingLength > Integer.MAX_VALUE) {
throw new IllegalStateException("File too large");
}
byte[] bytesToInsert = textToInsert.getBytes(StandardCharsets.UTF_8);
try (RandomAccessFile fh = new RandomAccessFile(file, "rw")) {
fh.seek(startPosition);
byte[] remainder = new byte[(int)remainingLength];
fh.readFully(remainder);
fh.seek(startPosition);
fh.write(bytesToInsert);
fh.write(remainder);
}
}
Java 7 or higher.

how to get input file name in hadoop cascading

In map-reduce I would extract the input file name as following
public void map(WritableComparable<Text> key, Text value, OutputCollector<Text,Text> output, Reporter reporter)
throws IOException {
FileSplit fileSplit = (FileSplit)reporter.getInputSplit();
String filename = fileSplit.getPath().getName();
System.out.println("File name "+filename);
System.out.println("Directory and File name"+fileSplit.getPath().toString());
process(key,value);
}
How can I do the similar with cascading
Pipe assembly = new Pipe(SomeFlowFactory.class.getSimpleName());
Function<Object> parseFunc = new SomeParseFunction();
assembly = new Each(assembly, new Fields(LINE), parseFunc);
...
public class SomeParseFunction extends BaseOperation<Object> implements Function<Object> {
...
#Override
public void operate(FlowProcess flowProcess, FunctionCall<Object> functionCall) {
how can I get the input file name here ???
}
Thanks,

I don't use Cascading but I think it should be sufficient to access the context instance, using functionCall.getContext(), to obtain the filename you can use:
String filename= ((FileSplit)context.getInputSplit()).getPath().getName();
However, it seems that cascading use the old API, if the above doesn't work you must try with:
Object name = flowProcess.getProperty( "map.input.file" );

Thank Engineiro for sharing the answer. However, when invoking hfp.getReporter().getInputSplit() method, I got MultiInputSplit type which can't be casted into FileSplit type directly in cascading 2.5.3. After diving into the related cascading APIs, I found a way and retrieved input file names successfully. Therefore, I would like to share this to supplement Engineiro's answer. Please see the following code.
HadoopFlowProcess hfp = (HadoopFlowProcess) flowProcess;
MultiInputSplit mis = (MultiInputSplit) hfp.getReporter().getInputSplit();
FileSplit fs = (FileSplit) mis.getWrappedInputSplit();
String fileName = fs.getPath().getName();

You would do this by getting the reporter within the buffer class, from the provided flowprocess argument in the buffer operate call.
HadoopFlowProcess hfp = (HadoopFlowProcess) flowprocess;
FileSplit fileSplit = (FileSplit)hfp.getReporter().getInputSplit();
.
.//the rest of your code
.

saving random numbers in java

I'm doing an animation in Processing. I'm using random points and I need to execute the code twice for stereo vision.
I have lots of random variables in my code, so I should save it somewhere for the second run or re-generate the SAME string of "random" numbers any time I run the program. (as said here: http://www.coderanch.com/t/372076/java/java/save-random-numbers)
Is this approach possible? How? If I save the numbers in a txt file and then read it, will my program run slower? What's the best way to do this?
Thanks.

If you just need to be able to generate the same sequence for a limited time, seeding the random number generator with the same value to generate the same sequence is most likely the easiest and fastest way to go. Just make sure that any parallel threads always request their pseudo random numbers in the same sequence, or you'll be in trouble.
Note though that there afaik is nothing guaranteeing the same sequence if you update your Java VM or even run a patch, so if you want long time storage for your sequence, or want to be able to use it outside of your Java program, you need to save it to a file.

Here is a sample example:
public static void writeRandomDoublesToFile(String filePath, int numbersCount) throws IOException
{
FileOutputStream fos = new FileOutputStream(new File(filePath));
BufferedOutputStream bos = new BufferedOutputStream(fos);
DataOutputStream dos = new DataOutputStream(bos);
dos.writeInt(numbersCount);
for(int i = 0; i < numbersCount; i++) dos.writeDouble(Math.random());
}
public static double[] readRandomDoublesFromFile(String filePath) throws IOException
{
FileInputStream fis = new FileInputStream(new File(filePath));
BufferedInputStream bis = new BufferedInputStream(fis);
DataInputStream dis = new DataInputStream(bis);
int numbersCount = dis.readInt();
double[] result = new double[numbersCount];
for(int i = 0; i < numbersCount; i++) result[i] = dis.readDouble();
return result;
}

Well, there's a couple of ways that you can approach this problem. One of them would be to save the random variables as input into a file and pass that file name as a parameter to your program.
And you could do that in one of two ways, the first of which would be to use the args[] parameter:
import java.io.*;
import java.util.*;
public class bla {
public static void main(String[] args) {
// You'd need to put some verification code here to make
// sure that input was actually sent to the program.
Scanner in = new Scanner(new File(args[1]));
while(in.hasNextLine()) {
System.out.println(in.nextLine());
}
} }
Another way would be to use Scanner and read from the console input. It's all the same code as above, but instead of Scanner in = new Scanner(new File(args[1])); and all the verification code above that. You'd substitute Scanner in = new Scanner(System.in), but that's just to load the file.
The process of generating those points could be done in the following manner:
import java.util.*;
import java.io.*;
public class generator {
public static void main(String[] args) {
// You'd get some user input (or not) here
// that would ask for the file to save to,
// and that can be done by either using the
// scanner class like the input example above,
// or by using args, but in this case we'll
// just say:
String fileName = "somefile.txt";
FileWriter fstream = new FileWriter(fileName);
BufferedWriter out = new BufferedWriter(fstream);
out.write("Stuff");
out.close();
}
}
Both of those solutions are simple ways to read and write to and from a file in Java. However, if you deploy either of those solutions, you're still left with some kind of parsing of the data.
If it were me, I'd go for object serialization, and store a binary copy of the data structure I've already generated to disk rather than having to parse and reparse that information in an inefficient way. (Using text files, usually, takes up more disk space.)
And here's how you would do that (Here, I'm going to reuse code that has already been written, and comment on it along the way) Source
You declare some wrapper class that holds data (you don't always have to do this, by the way.)
public class Employee implements java.io.Serializable
{
public String name;
public String address;
public int transient SSN;
public int number;
public void mailCheck()
{
System.out.println("Mailing a check to " + name
+ " " + address);
}
}
And then, to serialize:
import java.io.*;
public class SerializeDemo
{
public static void main(String [] args)
{
Employee e = new Employee();
e.name = "Reyan Ali";
e.address = "Phokka Kuan, Ambehta Peer";
e.SSN = 11122333;
e.number = 101;
try
{
FileOutputStream fileOut =
new FileOutputStream("employee.ser");
ObjectOutputStream out =
new ObjectOutputStream(fileOut);
out.writeObject(e);
out.close();
fileOut.close();
}catch(IOException i)
{
i.printStackTrace();
}
}
}
And then, to deserialize:
import java.io.*;
public class DeserializeDemo
{
public static void main(String [] args)
{
Employee e = null;
try
{
FileInputStream fileIn =
new FileInputStream("employee.ser");
ObjectInputStream in = new ObjectInputStream(fileIn);
e = (Employee) in.readObject();
in.close();
fileIn.close();
}catch(IOException i)
{
i.printStackTrace();
return;
}catch(ClassNotFoundException c)
{
System.out.println(.Employee class not found.);
c.printStackTrace();
return;
}
System.out.println("Deserialized Employee...");
System.out.println("Name: " + e.name);
System.out.println("Address: " + e.address);
System.out.println("SSN: " + e.SSN);
System.out.println("Number: " + e.number);
}
}
Another alternative solution to your problem, that does not involve storing data, is to create a lazy generator for whatever function that provides you your random values, and provide the same seed each and every time. That way, you don't have to store any data at all.
However, that still is quite a bit slower (I think) than serializing the object to disk and loading it back up again. (Of course, that's a really subjective statement, but I'm not going to enumerate cases where that is not true). The advantage of doing that is so that it doesn't require any kind of storage at all.
Another way, that you may have not possibly thought of, is to create a wrapper around your generator function that memoizes the output -- meaning that data that has already been generated before will be retrieved from memory and will not have to be generated again if the same inputs are true. You can see some resources on that here: Memoization source
The idea behind memoizing your function calls is that you save time without persisting to disk. This is ideal if the same values are generated over and over and over again. Of course, for a set of random points, this isn't going to work very well if every point is unique, but keep that in the back of your mind.
The really interesting part comes when considering the ways that all the previous strategies I've described in this post can be combined together.
It'd be interesting to setup a Memoizer class, like described in the second page of 2 and then implement java.io.Serialization in that class. After that, you can add methods save(String fileName) and load(String fileName) in the memoizer class that make serialization and deserialization easier, so you can persist the cache used to memoize the function. Very useful.
Anyway, enough is enough. In short, just use the same seed value, and generate the same point pairs on the fly.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

User Defined Function in Pig Latin - java

Related

Writing Strings to a binary file java

how to write to a new line of CSV file

Insert a string in the middle of text file without replacing [duplicate]

how to get input file name in hadoop cascading

saving random numbers in java

Categories

Resources