Write to separate files in Apache Spark (with Java)

Write to separate files in Apache Spark (with Java) - java

I am reading my data as whole text files. My object is of type Article which I defined. Here's the reading and processing of the data:
JavaPairRDD<String, String> filesRDD = context.wholeTextFiles(inputDataPath);
JavaRDD<Article> processingFiles = filesRDD.map(fileNameContent -> {
String content = fileNameContent._2();
Article a = new Article(content);
return a;
}
Now, once every file has been processed separately, I would like to write the result on HDFS as a separate file to, not with saveAsTextFile. I know that probably I have to do it with foreach, so:
processingFiles.foreach(a -> {
// Here is a pseudo code of how I want to do this
String fileName = here_is_full_file_name_to_write_to_hdfs;
writeToDisk(fileName, a); // This could be a simple text file
});
Any ideas how to do this in Java?

Related

Convert String to JavaRDD<String>

I want to make some computation on each text file from directory, and then use the results to compute another value.
To read files from directory I use:
JavaPairRDD<String, String> textFiles = sc.wholeTextFiles(PATH);
Next, for each file
textFiles.foreach(file -> processFile(file));
I want to make some magic like computing frequent words.
I have an access to the path of the file and its content.
JavaRDD offers methods such as flatMap, mapToPair, reduceByKey which I need.
The question is, is there any way to convert the value of the JavaPairRDD to JavaRDD?

The question is, is there any way to convert the value of the JavaPairRDD to JavaRDD?
textFiles.keys(); //Return an RDD with the keys of each tuple.
textFiles.values(); // Return an RDD with the values of each tuple.
*** UPDATE:
As per your updated question, I think the below achieves what you need. I created two CSV files in a directory "tmp".
one.csv:
one,1
two,2
three,3
two.csv:
four,4
five,5
six,6
Then ran the following code LOCALLY:
String appName = UUID.randomUUID().toString();
SparkConf sc = new SparkConf().setAppName(appName).setMaster("local");
JavaSparkContext jsc = new JavaSparkContext(sc);
JavaPairRDD<String, String> fS = jsc.wholeTextFiles("tmp");
System.out.println("File names:");
fS.keys().collect().forEach(new Consumer<String>(){
public void accept(String t)
{
System.out.println(t);
}});
System.out.println("File content:");
fS.values().collect().forEach(new Consumer<String>(){
public void accept(String t)
{
System.out.println(t);
}});
jsc.close();
It produces the following output (I removed all unnecessary Spark output and edited my directory paths)
File names:
file:/......[my dir here]/one.csv
file:/......[my dir here]/two.csv
File content:
one,1
two,2
three,3
four,4
five,5
six,6
Seems like this is what you were asking for...

Reading multiple files from S3 in parallel (Spark, Java)

I saw a few discussions on this but couldn't quite understand the right solution:
I want to load a couple hundred files from S3 into an RDD. Here is how I'm doing it now:
ObjectListing objectListing = s3.listObjects(new ListObjectsRequest().
withBucketName(...).
withPrefix(...));
List<String> keys = new LinkedList<>();
objectListing.getObjectSummaries().forEach(summery -> keys.add(summery.getKey())); // repeat while objectListing.isTruncated()
JavaRDD<String> events = sc.parallelize(keys).flatMap(new ReadFromS3Function(clusterProps));
The ReadFromS3Function does the actual reading using the AmazonS3 client:
public Iterator<String> call(String s) throws Exception {
AmazonS3 s3Client = getAmazonS3Client(properties);
S3Object object = s3Client.getObject(new GetObjectRequest(...));
InputStream is = object.getObjectContent();
List<String> lines = new LinkedList<>();
String str;
try {
BufferedReader reader = new BufferedReader(new InputStreamReader(is));
if (is != null) {
while ((str = reader.readLine()) != null) {
lines.add(str);
}
} else {
...
}
} finally {
...
}
return lines.iterator();
I kind of "translated" this from answers I saw for the same question in Scala. I think it's also possible to pass the entire list of paths to sc.textFile(...), but I'm not sure which is the best-practice way.

the underlying problem is that listing objects in s3 is really slow, and the way it is made to look like a directory tree kills performance whenever something does a treewalk (as wildcard pattern maching of paths does).
The code in the post is doing the all-children listing which delivers way better performance, it's essentially what ships with Hadoop 2.8 and s3a listFiles(path, recursive) see HADOOP-13208.
After getting that listing, you've got strings to objects paths which you can then map to s3a/s3n paths for spark to handle as text file inputs, and which you can then apply work to
val files = keys.map(key -> s"s3a://$bucket/$key").mkString(",")
sc.textFile(files).map(...)
And as requested, here's the java code used.
String prefix = "s3a://" + properties.get("s3.source.bucket") + "/";
objectListing.getObjectSummaries().forEach(summary -> keys.add(prefix+summary.getKey()));
// repeat while objectListing truncated
JavaRDD<String> events = sc.textFile(String.join(",", keys))
Note that I switched s3n to s3a, because, provided you have the hadoop-aws and amazon-sdk JARs on your CP, the s3a connector is the one you should be using. It's better, and its the one which gets maintained and tested against spark workloads by people (me). See The history of Hadoop's S3 connectors.

You may use sc.textFile to read multiple files.
You can pass multiple file url with as its argument.
You can specify whole directories, use wildcards and even CSV of directories and wildcards.
Ex:
sc.textFile("/my/dir1,/my/paths/part-00[0-5]*,/another/dir,/a/specific/file")
Reference from this ans

I guess if you try to parallelize while reading aws will be utilizing executor and definitely improve the performance
val bucketName=xxx
val keyname=xxx
val df=sc.parallelize(new AmazonS3Client(new BasicAWSCredentials("awsccessKeyId", "SecretKey")).listObjects(request).getObjectSummaries.map(_.getKey).toList)
.flatMap { key => Source.fromInputStream(s3.getObject(bucketName, keyname).getObjectContent: InputStream).getLines }

Fastest way to read a large XML file in Java

I'm working on a java project to optimize existing code. Currently i'm using BufferedReader/FileInputStream to read content of an XML file as String in Java.
But my question is , is there any faster way to read XML content.Are SAX/DOM faster than BufferedReader/FileInputStream?
Need help regarding the above issue.
Thanks in advance.

I think that your code shown in other question is faster than DOM-like parsers which would definitely require more memory and likely some computation in order to reconstruct the document in full. You may want to profile the code though.
I also think that your code can be prettified a bit for streaming processing if you would use javax XMLStreamReader, which I found quite helpful for many tasks. That class is "... is designed to be the lowest level and most efficient way to read XML data", according to Oracle.
Here is the excerpt from my code where I parse StackOverflow users XML file distributed as a public data dump:
// the input file location
private static final String fileLocation = "/media/My Book/Stack/users.xml";
// the target elements
private static final String USERS_ELEMENT = "users";
private static final String ROW_ELEMENT = "row";
// get the XML file handler
//
FileInputStream fileInputStream = new FileInputStream(fileLocation);
XMLStreamReader xmlStreamReader = XMLInputFactory.newInstance().createXMLStreamReader(
fileInputStream);
// reading the data
//
while (xmlStreamReader.hasNext()) {
int eventCode = xmlStreamReader.next();
// this triggers _users records_ logic
//
if ((XMLStreamConstants.START_ELEMENT == eventCode)
&& xmlStreamReader.getLocalName().equalsIgnoreCase(USERS_ELEMENT)) {
// read and parse the user data rows
//
while (xmlStreamReader.hasNext()) {
eventCode = xmlStreamReader.next();
// this breaks _users record_ reading logic
//
if ((XMLStreamConstants.END_ELEMENT == eventCode)
&& xmlStreamReader.getLocalName().equalsIgnoreCase(USERS_ELEMENT)) {
break;
}
else {
if ((XMLStreamConstants.START_ELEMENT == eventCode)
&& xmlStreamReader.getLocalName().equalsIgnoreCase(ROW_ELEMENT)) {
// extract the user data
//
User user = new User();
int attributesCount = xmlStreamReader.getAttributeCount();
for (int i = 0; i < attributesCount; i++) {
user.setAttribute(xmlStreamReader.getAttributeLocalName(i),
xmlStreamReader.getAttributeValue(i));
}
// all other user record-related logic
//
}
}
}
}
}
That users file format is quite simple and similar to your Bank.xml file:
<users>
<row Id="1567200" Reputation="1" CreationDate="2012-07-31T23:57:57.770" DisplayName="XXX" EmailHash="XXX" LastAccessDate="2012-08-01T00:55:12.953" Views="0" UpVotes="0" DownVotes="0" />
...
</users>

There are different parser options available.
Consider using a streaming parser, because the DOM may become quite big. I.e. either a push or a pull parser.
It's not as if XML parsers are necessarily slow. Consider your web browser. It does XML parsing all the time, and tries really hard to be robust to syntax errors. Usually, memory is the bigger issue.

Read PDVInputStream dicomObject information on onCStoreRQ association request

I am trying to read (and then store to 3rd party local db) certain DICOM object tags "during" an incoming association request.
For accepting association requests and storing locally my dicom files i have used a modified version of dcmrcv() tool. More specifically i have overriden onCStoreRQ method like:
#Override
protected void onCStoreRQ(Association association, int pcid, DicomObject dcmReqObj,
PDVInputStream dataStream, String transferSyntaxUID,
DicomObject dcmRspObj)
throws DicomServiceException, IOException {
final String classUID = dcmReqObj.getString(Tag.AffectedSOPClassUID);
final String instanceUID = dcmReqObj.getString(Tag.AffectedSOPInstanceUID);
config = new GlobalConfig();
final File associationDir = config.getAssocDirFile();
final String prefixedFileName = instanceUID;
final String dicomFileBaseName = prefixedFileName + DICOM_FILE_EXTENSION;
File dicomFile = new File(associationDir, dicomFileBaseName);
assert !dicomFile.exists();
final BasicDicomObject fileMetaDcmObj = new BasicDicomObject();
fileMetaDcmObj.initFileMetaInformation(classUID, instanceUID, transferSyntaxUID);
final DicomOutputStream outStream = new DicomOutputStream(new BufferedOutputStream(new FileOutputStream(dicomFile), 600000));
//i would like somewhere here to extract some TAGS from incoming dicom object. By trying to do it using dataStream my dicom files
//are getting corrupted!
//System.out.println("StudyInstanceUID: " + dataStream.readDataset().getString(Tag.StudyInstanceUID));
try {
outStream.writeFileMetaInformation(fileMetaDcmObj);
dataStream.copyTo(outStream);
} finally {
outStream.close();
}
dicomFile.renameTo(new File(associationDir, dicomFileBaseName));
System.out.println("DICOM file name: " + dicomFile.getName());
}
#Override
public void associationAccepted(final AssociationAcceptEvent associationAcceptEvent) {
....
#Override
public void associationClosed(final AssociationCloseEvent associationCloseEvent) {
...
}
I would like somewhere between this code to intercept a method wich will read dataStream and will parse specific tags and store to a local database.
However wherever i try to put a piece of code that tries to manipulate (just read for start) dataStream then my dicom files get corrupted!
PDVInputStream is implementing java.io.InputStream ....
Even if i try to just put a:
System.out.println("StudyInstanceUID: " + dataStream.readDataset().getString(Tag.StudyInstanceUID));
before copying datastream to outStream ... then my dicom files are getting corrupted (1KB of size) ...
How am i supposed to use datastream in a CStoreRQ association request to extract some information?
I hope my question is clear ...

The PDVInputStream is probably a PDUDecoder class. You'll have to reset the position when using the input stream multiple times.
Maybe a better solution would be to store the DICOM object in memory and use that for both purposes. Something akin to:
DicomObject dcmobj = dataStream.readDataset();
String whatYouWant = dcmobj.get( Tag.whatever );
dcmobj.initFileMetaInformation( transferSyntaxUID );
outStream.writeDicomFile( dcmobj );

Pig: Reparse Strings into Tuples in Java

I'll have a Pig script that ends with storing it's contents in a text file.
STORE foo into 'outputLocation';
During a completely different job I want to read lines of this file, and parse them back into Tuples. The data in foo might contains chararrays with characters used when you save Pig Bags/tuples like { } ( ) , etc. I can read the previously saved file using code like.
FileSystem fs = FileSystem.get(UDFContext.getUDFContext().getJobConf());
FileStatus[] fileStatuses = fs.listStatus(new Path("outputLocation"));
for (FileStatus fileStatus : fileStatuses) {
if (fileStatus.getPath().getName().contains("part")) {
DataInputStream in = fs.open(fileStatus.getPath());
String line;
while ((line = in.readLine()) != null) {
// Do stuff
}
}
}
Now where // Do stuff is, I'd like to parse my String back into a Tuple. Is this possible/ does Pig provide an API? The closest I could find is the StorageUtil class textToTuple function, but that just makes a Tuple containing one DataByteArray. I want a tuple containing other bags, tuples, chararrays like it originally had so I can refetch the original fields easily. I can change the StoreFunc I save the original file in, if that helps.

This is the plain Pig solution without using JSON or UDF. I have found it the hard way.
import org.apache.pig.ResourceSchema.ResourceFieldSchema;
import org.apache.pig.builtin.Utf8StorageConverter;
import org.apache.pig.data.DataBag;
import org.apache.pig.data.Tuple;
import org.apache.pig.newplan.logical.relational.LogicalSchema;
import org.apache.pig.impl.util.Utils;
Let's say your string to be parsed is this:
String tupleString = "(quick,123,{(brown,1.0),(fox,2.5)})";
First, parse your schema string. Note that you have an enclosing tuple.
LogicalSchema schema = Utils.parseSchema("a0:(a1:chararray, a2:long, a3:{(a4:chararray, a5:double)})");
Then parse your tuple with your schema.
Utf8StorageConverter converter = new Utf8StorageConverter();
ResourceFieldSchema fieldSchema = new ResourceFieldSchema(schema.getField("a0"));
Tuple tuple = converter.bytesToTuple(tupleString.getBytes("UTF-8"), fieldSchema);
Voila! Check your data.
assertEquals((String) tuple.get(0), "quick");
assertEquals(((DataBag) tuple.get(2)).size(), 2L);

I would just output the data into JSON format. Pig has native support for parsing JSON until tuples. It would avoid you having to write a UDF.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Write to separate files in Apache Spark (with Java) - java

Related

Convert String to JavaRDD<String>

Reading multiple files from S3 in parallel (Spark, Java)

Fastest way to read a large XML file in Java

Read PDVInputStream dicomObject information on onCStoreRQ association request

Pig: Reparse Strings into Tuples in Java

Categories

Resources