Pig: Reparse Strings into Tuples in Java - java

I'll have a Pig script that ends with storing it's contents in a text file.
STORE foo into 'outputLocation';
During a completely different job I want to read lines of this file, and parse them back into Tuples. The data in foo might contains chararrays with characters used when you save Pig Bags/tuples like { } ( ) , etc. I can read the previously saved file using code like.
FileSystem fs = FileSystem.get(UDFContext.getUDFContext().getJobConf());
FileStatus[] fileStatuses = fs.listStatus(new Path("outputLocation"));
for (FileStatus fileStatus : fileStatuses) {
if (fileStatus.getPath().getName().contains("part")) {
DataInputStream in = fs.open(fileStatus.getPath());
String line;
while ((line = in.readLine()) != null) {
// Do stuff
}
}
}
Now where // Do stuff is, I'd like to parse my String back into a Tuple. Is this possible/ does Pig provide an API? The closest I could find is the StorageUtil class textToTuple function, but that just makes a Tuple containing one DataByteArray. I want a tuple containing other bags, tuples, chararrays like it originally had so I can refetch the original fields easily. I can change the StoreFunc I save the original file in, if that helps.

This is the plain Pig solution without using JSON or UDF. I have found it the hard way.
import org.apache.pig.ResourceSchema.ResourceFieldSchema;
import org.apache.pig.builtin.Utf8StorageConverter;
import org.apache.pig.data.DataBag;
import org.apache.pig.data.Tuple;
import org.apache.pig.newplan.logical.relational.LogicalSchema;
import org.apache.pig.impl.util.Utils;
Let's say your string to be parsed is this:
String tupleString = "(quick,123,{(brown,1.0),(fox,2.5)})";
First, parse your schema string. Note that you have an enclosing tuple.
LogicalSchema schema = Utils.parseSchema("a0:(a1:chararray, a2:long, a3:{(a4:chararray, a5:double)})");
Then parse your tuple with your schema.
Utf8StorageConverter converter = new Utf8StorageConverter();
ResourceFieldSchema fieldSchema = new ResourceFieldSchema(schema.getField("a0"));
Tuple tuple = converter.bytesToTuple(tupleString.getBytes("UTF-8"), fieldSchema);
Voila! Check your data.
assertEquals((String) tuple.get(0), "quick");
assertEquals(((DataBag) tuple.get(2)).size(), 2L);

I would just output the data into JSON format. Pig has native support for parsing JSON until tuples. It would avoid you having to write a UDF.

Related

Reading json file and updating

Im trying to understand the procedure to do what the title says.
Im doing this in java with Gson dependency.
I am getting information from another service I use, in JSON format. So I want to get that info, put some additional info in there (like date/time) and use it afterwards for searching purposes.
The procedure is :
Get the JSON info (lets say "id") and add it to the JSON file you have
Add more info to that JSON file (lets say "Date and time of upload")
Finally, save that updated JSON file
So I read the file:
JsonReader reader = new JsonReader(new FileReader(filename));
Do I have now to convert it to string, and then update the string, so I can finally write it back to json?
If it doesn't exist, I create an empty file and then, can I update it with Json/Gson data? or do I have to create a Json File?
try {
File jsonFile = new File("C:\\uploads\\datasets");
if (jsonFile.createNewFile()){
System.out.println("File is created!");
}else{
System.out.println("File already exists.");
}
} catch (IOException e) {
e.printStackTrace();
}
Excuse any newbie/stupid mistakes I've probably made, I'm trying to understand JSON. Actually, the philosophy behind it.
JSON stands for JavaScript Object Notation and it's nothing more than a way to format data.
Taken from here:
JSON is built on two structures:
A collection of name/value pairs. In various languages, this is
realized as an object, record, struct, dictionary, hash table, keyed
list, or associative array.
An ordered list of values. In most
languages, this is realized as an array, vector, list, or sequence.
To address your questions:
Get the JSON info (lets say "id") and add it to the JSON file you have
JsonReader reader = new JsonReader(new FileReader(inputFilename));
reader.beginArray();
reader.beginObject();
long id = -1;
while (reader.hasNext()) {
String value = reader.nextName();
if (value.equals("id")) {
id = reader.nextLong();
} else {
reader.skipValue();
}
reader.endObject();
reader.endArray();
Add more info to that JSON file (lets say "Date and time of upload")
This will get the format in YYYY.MM.DD-HH.MM.SS
String timeStamp = new SimpleDateFormat("yyyyMMdd_HHmmss").format(Calendar.getInstance().getTime());
Finally, save that updated JSON file
Create a JsonWriter.
JsonWriter writer = new JsonWriter(new FileWriter(outputFilename));
writer.beginArray();
writer.beginObject();
writer.name("id").value(id);
writer.name("timestamp").value(timestamp);
writer.endObject();
writer.endArray();
You can read more about JsonReader and JsonWriter here and here.

Best way to populate a user defined object using the values of string array

I am reading two different csv files and populating data into two different objects. I am splitting each line of csv file based on regex(regex is different for two csv files) and populating the object using each data of that array which is obtained by splitting each line using regex as shown below:
public static <T> List<T> readCsv(String filePath, String type) {
List<T> list = new ArrayList<T>();
try {
File file = new File(filePath);
FileInputStream fileInputStream = new FileInputStream(file);
InputStreamReader inputStreamReader = new InputStreamReader(fileInputStream);
BufferedReader bufferedReader = new BufferedReader(inputStreamReader)
list = bufferedReader.lines().skip(1).map(line -> {
T obj = null;
String[] data = null;
if (type.equalsIgnoreCase("Student")) {
data = line.split(",");
ABC abc = new ABC();
abc.setName(data[0]);
abc.setRollNo(data[1]);
abc.setMobileNo(data[2]);
obj = (T)abc;
} else if (type.equalsIgnoreCase("Employee")) {
data = line.split("\\|");
XYZ xyz = new XYZ();s
xyz.setName(Integer.parseInt(data[0]));
xyz.setCity(data[1]);
xyz.setEmployer(data[2]);
xyz.setDesignation(data[3]);
obj = (T)xyz;
}
return obj;
}).collect(Collectors.toList());} catch(Exception e) {
}}
csv files are as below:
i. csv file to populate ABC object:
Name,rollNo,mobileNo
Test1,1000,8888888888
Test2,1001,9999999990
ii. csv file to populate XYZ object
Name|City|Employer|Designation
Test1|City1|Emp1|SSE
Test2|City2|Emp2|
The issue is there can be a missing data for any of the above columns in the csv file as shown in the second csv file. In that case, I will get ArrayIndexOutOfBounds exception.
Can anyone let me know what is the best way to populate the object using the data of the string array?
Thanks in advance.
In addition to the other mistakes you made and that were pointed out to you in the comments your actual problem is caused by line.split("\\|") calling line.split("\\|", 0) which discards the trailing empty String. You need to call it with line.split("\\|", -1) instead and it will work.
The problem appears to be that one or more of the last values on any given CSV line may be empty. In that case, you run into the fact that String.split(String) suppresses trailing empty strings.
Supposing that you can rely on all the fields in fact being present, even if empty, you can simply use the two-arg form of split():
data = line.split(",", -1);
You can find details in that method's API docs.
If you cannot be confident that the fields will be present at all, then you can force them to be by adding delimiters to the end of the input string:
data = (line + ",,").split(",", -1);
Since you only use the first values few values, any extra trailing values introduced by the extra delimiters would be ignored.

Write to separate files in Apache Spark (with Java)

I am reading my data as whole text files. My object is of type Article which I defined. Here's the reading and processing of the data:
JavaPairRDD<String, String> filesRDD = context.wholeTextFiles(inputDataPath);
JavaRDD<Article> processingFiles = filesRDD.map(fileNameContent -> {
String content = fileNameContent._2();
Article a = new Article(content);
return a;
}
Now, once every file has been processed separately, I would like to write the result on HDFS as a separate file to, not with saveAsTextFile. I know that probably I have to do it with foreach, so:
processingFiles.foreach(a -> {
// Here is a pseudo code of how I want to do this
String fileName = here_is_full_file_name_to_write_to_hdfs;
writeToDisk(fileName, a); // This could be a simple text file
});
Any ideas how to do this in Java?

How to serialize the data to AVRO schema in Spark (with Java)?

I have defined an AVRO schema, and generated some classes with avro-tools for the schemes. Now, I want to serialize the data to disk. I found some answers about scala for this, but not for Java. The class Article is generated with avro-tools, and is made from a schema defined by me.
Here's a simplified version of the code of how I try to do it:
JavaPairRDD<String, String> filesRDD = context.wholeTextFiles(inputDataPath);
JavaRDD<Article> processingFiles = filesRDD.map(fileNameContent -> {
// The name of the file
String fileName = fileNameContent._1();
// The content of the file
String fileContent = fileNameContent._2();
// An object from my avro schema
Article a = new Article(fileContent);
Processing processing = new Processing();
// .... some processing of the content here ... //
processing.serializeArticleToDisk(avroFileName);
return a;
});
where serializeArticleToDisk(avroFileName) is defined as follows:
public void serializeArticleToDisk(String filename) throws IOException{
// Serialize article to disk
DatumWriter<Article> articleDatumWriter = new SpecificDatumWriter<Article>(Article.class);
DataFileWriter<Article> dataFileWriter = new DataFileWriter<Article>(articleDatumWriter);
dataFileWriter.create(this.article.getSchema(), new File(filename));
dataFileWriter.append(this.article);
dataFileWriter.close();
}
where Article is my avro schema.
Now, the mapper throws me the error:
java.io.FileNotFoundException: hdfs:/...path.../avroFileName.avro (No such file or directory)
at java.io.FileOutputStream.open0(Native Method)
at java.io.FileOutputStream.open(FileOutputStream.java:270)
at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
at java.io.FileOutputStream.<init>(FileOutputStream.java:162)
at org.apache.avro.file.SyncableFileOutputStream.<init>(SyncableFileOutputStream.java:60)
at org.apache.avro.file.DataFileWriter.create(DataFileWriter.java:129)
at org.apache.avro.file.DataFileWriter.create(DataFileWriter.java:129)
at sentences.ProcessXML.serializeArticleToDisk(ProcessXML.java:207)
. . . rest of the stacktrace ...
although the file path is correct.
I use a collect() method afterwards, so everything else within the map function works fine (except for the serialization part).
I am quite new with Spark, so I am not sure if this might be something trivial actually. I suspect that I need to use some writing functions, not to do the writing in the mapper (not sure if this is true, though). Any ideas how to tackle this?
EDIT:
The last line of the error stack-trace I showed, is actually on this part:
dataFileWriter.create(this.article.getSchema(), new File(filename));
This is the part that throws the actual error. I am assuming the dataFileWriter needs to be replaced with something else. Any ideas?
This solution is not using data-frames and is not throwing any errors:
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.io.NullWritable;
import org.apache.avro.mapred.AvroKey;
import org.apache.spark.api.java.JavaPairRDD;
import scala.Tuple2;
. . . . .
// Serializing to AVRO
JavaPairRDD<AvroKey<Article>, NullWritable> javaPairRDD = processingFiles.mapToPair(r -> {
return new Tuple2<AvroKey<Article>, NullWritable>(new AvroKey<Article>(r), NullWritable.get());
});
Job job = AvroUtils.getJobOutputKeyAvroSchema(Article.getClassSchema());
javaPairRDD.saveAsNewAPIHadoopFile(outputDataPath, AvroKey.class, NullWritable.class, AvroKeyOutputFormat.class,
job.getConfiguration());
where AvroUtils.getJobOutputKeyAvroSchema is:
public static Job getJobOutputKeyAvroSchema(Schema avroSchema) {
Job job;
try {
job = new Job();
} catch (IOException e) {
throw new RuntimeException(e);
}
AvroJob.setOutputKeySchema(job, avroSchema);
return job;
}
Similar things for Spark + Avro can be found here -> https://github.com/CeON/spark-utils.
It seems that you use Spark in a wrong way.
Map is a transformation function. Just calling map doesn't invoke calulation of RDD. You have to call action like forEach() or collect().
Also note, that lambda supplied to map will be serialized at driver and transferred to some Node in a cluster.
ADDED
Try to use Spark SQL and Spark-Avro to save Spark DataFrame in Avro format:
// Load a text file and convert each line to a JavaBean.
JavaRDD<Person> people = sc.textFile("/examples/people.txt")
.map(Person::parse);
// Apply a schema to an RDD
DataFrame peopleDF = sqlContext.createDataFrame(people, Person.class);
peopleDF.write()
.format("com.databricks.spark.avro")
.save("/output");

How to change some values in a .JSON file and then write it back while keeping the JSON formatting ? (Java)

The JSON example file consists of:
{
"1st_key": "value1",
"2nd_key": "value2",
"object_keys": {
"obj_1st": "value1",
"obj_2nd": "value2",
"obj_3rd": "value3",
}
}
I read the JSON file into a String with this StringBuilder method, in order to add the newlines into the string itself. So the String looks exactly like the JSON file above.
public String getJsonContent(String fileName) {
StringBuilder result = new StringBuilder("");
File file = new File(fileName);
try (Scanner scanner = new Scanner(file)) {
while (scanner.hasNextLine()) {
String line = scanner.nextLine();
result.append(line).append("\n");
}
scanner.close();
} catch (IOException e) {
e.printStackTrace();
}
return result.toString();
}
Then I translate the JSON file into an Object using MongoDB API (with DBObject, BasicDBObject and util.JSON) and I call out the Object section I need to change, which is 'object_keys':
File jsonFile = new File(C:\\example.json);
String jsonString = getJsonContent(jsonFile.getAbsolutePath());
DBObject jsonObject = (DBObject)JSON.parse(jsonString);
BasicDBObject objectKeys = (BasicDBObject) jsonObject.get("object_keys");
Now I can write new values into the Object using the PUT method like this:
objectKeys.put("obj_1st","NEW_VALUE1");
objectKeys.put("obj_2nd","NEW_VALUE2");
objectKeys.put("obj_3rd","NEW_VALUE3");
! This following part not needed, check out my answer below.
After I have changed the object, I need to write it back into the json file, so I need to translate the Object into a String. There are two methods to do this, either one works.
String newJSON = jsonObject.toString();
or
String newJSON = JSON.serialize(jsonObject);
Then I write the content back into the file using PrintWriter
PrintWriter writer = new PrintWriter(C:\\example.json)
writer.print(newJSON);
writer.close();
The problem I am facing now is that the String that is written is in a single line with no formatting whatosever. Somewhere along the way it lost all the newlines. So it basically looks like this:
{"1st_key": "value1","2nd_key": "value2","object_keys": { "obj_1st": "NEW_VALUE1","obj_2nd": "NEW_VALUE2","obj_3rd": "NEW_VALUE3", }}
I need to write the JSON file back in the same format as shown in the beginning, keeping all the tabulation, spaces etc.
Is this possible somehow ?
When you want something formatted the way you said it is addressed as writing to a file in a pretty/beautiful way. For example: Output beautiful json. A quick search on google found what i believe to solve your problem.
Solution
You're going to have to use a json parser of some sort. I personally prefer org.json and would recommend it if you are manipulating the json data, but you may also like json-io which is really good for json serialization with no external dependencies.
With json-io, it's as simple as
String formattedJson = JsonWriter.formatJson(jsonObject.toString())
With org.json, you simply pass an int to the toString method.
Thanks Saraiva, I found a surprisingly simple solution by Googling around with the words 'pretty printing JSON' and used the Google GSON library. I downloaded the .jar and added it to my project in Eclipse.
These are the new imports I needed:
import com.google.gson.Gson;
import com.google.gson.GsonBuilder;
Since I already had the JSON Object (jsonObject) readily available from my previous code, I only needed to add two new lines:
Gson gson = new GsonBuilder().setPrettyPrinting().create();
String newJSON = gson.toJson(jsonObject);
Now when I use writer.print(newJSON); it will write the JSON in the right format, beautifully formatted and indented.

Categories