Convert String to JavaRDD<String>

Convert String to JavaRDD<String> - java

I want to make some computation on each text file from directory, and then use the results to compute another value.
To read files from directory I use:
JavaPairRDD<String, String> textFiles = sc.wholeTextFiles(PATH);
Next, for each file
textFiles.foreach(file -> processFile(file));
I want to make some magic like computing frequent words.
I have an access to the path of the file and its content.
JavaRDD offers methods such as flatMap, mapToPair, reduceByKey which I need.
The question is, is there any way to convert the value of the JavaPairRDD to JavaRDD?

The question is, is there any way to convert the value of the JavaPairRDD to JavaRDD?
textFiles.keys(); //Return an RDD with the keys of each tuple.
textFiles.values(); // Return an RDD with the values of each tuple.
*** UPDATE:
As per your updated question, I think the below achieves what you need. I created two CSV files in a directory "tmp".
one.csv:
one,1
two,2
three,3
two.csv:
four,4
five,5
six,6
Then ran the following code LOCALLY:
String appName = UUID.randomUUID().toString();
SparkConf sc = new SparkConf().setAppName(appName).setMaster("local");
JavaSparkContext jsc = new JavaSparkContext(sc);
JavaPairRDD<String, String> fS = jsc.wholeTextFiles("tmp");
System.out.println("File names:");
fS.keys().collect().forEach(new Consumer<String>(){
public void accept(String t)
{
System.out.println(t);
}});
System.out.println("File content:");
fS.values().collect().forEach(new Consumer<String>(){
public void accept(String t)
{
System.out.println(t);
}});
jsc.close();
It produces the following output (I removed all unnecessary Spark output and edited my directory paths)
File names:
file:/......[my dir here]/one.csv
file:/......[my dir here]/two.csv
File content:
one,1
two,2
three,3
four,4
five,5
six,6
Seems like this is what you were asking for...

Related

Spark Save as Text File grouped by Key

I would like to save RDD to text file grouped by key, currently I can't figure out how to split the output to multiple files, it seems all the output spanning across multiple keys which share the same partition gets written to the same file. I would like to have different files for each key. Here's my code snippet :
JavaPairRDD<String, Iterable<Customer>> groupedResults = customerCityPairRDD.groupByKey();
groupedResults.flatMap(x -> x._2().iterator())
.saveAsTextFile(outputPath + "/cityCounts");

This can be achieved by using foreachPartition to save each partitions into separate file.
You can develop your code as follows
groupedResults.foreachPartition(new VoidFunction<Iterator<Customer>>() {
#Override
public void call(Iterator<Customer> rec) throws Exception {
FSDataOutputStream fsoutputStream = null;
BufferedWriter writer = null;
try {
fsoutputStream = FileSystem.get(new Configuration()).create(new Path("path1"))
writer = new BufferedWriter(fsoutputStream)
while (rec.hasNext()) {
Customer cust = rec.next();
writer.write(cust)
}
} catch (Exception exp) {
exp.printStackTrace()
//Handle exception
}
finally {
// close writer.
}
}
});
Hope this helps.
Ravi

So I figured how to solve this. Convert RDD to Dataframe and then just partition by key during write.
Dataset<Row> dataFrame = spark.createDataFrame(customerRDD, Customer.class);
dataFrame.write()
.partitionBy("city")
.text("cityCounts"); // write as text file at file path cityCounts

How to convert JavaDStream into RDD ? OR Is there a way i can create new RDD inside map function of JavaDStream?

The streaming data which i am getting from kafka is the path of hdfs file and i need to get the data of that file .
batchInputDStream.map(new Function<Tuple2<String,String>, FreshBatchInput>() {
#Override
public String call(Tuple2<String, String> arg0)
throws Exception {
StringReader reader = new StringReader(arg0._2);
JAXBContext jaxbContext = JAXBContext.newInstance(FreshBatchInput.class);
Unmarshaller jaxbUnmarshaller = jaxbContext.createUnmarshaller();
FreshBatchInput input = (FreshBatchInput)jaxbUnmarshaller.unmarshal(reader);
return input.getPath();
}
});
here input.getPath() is the hdfs path of file .
There is no option to collect JavaDstream Object otherwise i would have used that by first collecting data and than getting data from file.
Iam not able to create new RDD inside map function it is giving error Task Not Serializable.
Is there any other option ?

You can use foreachRDD. It is executed on driver, so rdd actions are allowed
transformed.foreachRDD (rdd -> {
String inputPath = doSomethingWithRDD(rdd)
rdd.sparkContext.textFile(inputPath) ...
});
Remember that you cannot create RDD inside transformations or actions - RDDs can be created only on the driver. Similar question with example of foreachRDD is here. This means, you cannot use SparkContext inside map, filter or foreachPartition

Reading multiple files from S3 in parallel (Spark, Java)

I saw a few discussions on this but couldn't quite understand the right solution:
I want to load a couple hundred files from S3 into an RDD. Here is how I'm doing it now:
ObjectListing objectListing = s3.listObjects(new ListObjectsRequest().
withBucketName(...).
withPrefix(...));
List<String> keys = new LinkedList<>();
objectListing.getObjectSummaries().forEach(summery -> keys.add(summery.getKey())); // repeat while objectListing.isTruncated()
JavaRDD<String> events = sc.parallelize(keys).flatMap(new ReadFromS3Function(clusterProps));
The ReadFromS3Function does the actual reading using the AmazonS3 client:
public Iterator<String> call(String s) throws Exception {
AmazonS3 s3Client = getAmazonS3Client(properties);
S3Object object = s3Client.getObject(new GetObjectRequest(...));
InputStream is = object.getObjectContent();
List<String> lines = new LinkedList<>();
String str;
try {
BufferedReader reader = new BufferedReader(new InputStreamReader(is));
if (is != null) {
while ((str = reader.readLine()) != null) {
lines.add(str);
}
} else {
...
}
} finally {
...
}
return lines.iterator();
I kind of "translated" this from answers I saw for the same question in Scala. I think it's also possible to pass the entire list of paths to sc.textFile(...), but I'm not sure which is the best-practice way.

the underlying problem is that listing objects in s3 is really slow, and the way it is made to look like a directory tree kills performance whenever something does a treewalk (as wildcard pattern maching of paths does).
The code in the post is doing the all-children listing which delivers way better performance, it's essentially what ships with Hadoop 2.8 and s3a listFiles(path, recursive) see HADOOP-13208.
After getting that listing, you've got strings to objects paths which you can then map to s3a/s3n paths for spark to handle as text file inputs, and which you can then apply work to
val files = keys.map(key -> s"s3a://$bucket/$key").mkString(",")
sc.textFile(files).map(...)
And as requested, here's the java code used.
String prefix = "s3a://" + properties.get("s3.source.bucket") + "/";
objectListing.getObjectSummaries().forEach(summary -> keys.add(prefix+summary.getKey()));
// repeat while objectListing truncated
JavaRDD<String> events = sc.textFile(String.join(",", keys))
Note that I switched s3n to s3a, because, provided you have the hadoop-aws and amazon-sdk JARs on your CP, the s3a connector is the one you should be using. It's better, and its the one which gets maintained and tested against spark workloads by people (me). See The history of Hadoop's S3 connectors.

You may use sc.textFile to read multiple files.
You can pass multiple file url with as its argument.
You can specify whole directories, use wildcards and even CSV of directories and wildcards.
Ex:
sc.textFile("/my/dir1,/my/paths/part-00[0-5]*,/another/dir,/a/specific/file")
Reference from this ans

I guess if you try to parallelize while reading aws will be utilizing executor and definitely improve the performance
val bucketName=xxx
val keyname=xxx
val df=sc.parallelize(new AmazonS3Client(new BasicAWSCredentials("awsccessKeyId", "SecretKey")).listObjects(request).getObjectSummaries.map(_.getKey).toList)
.flatMap { key => Source.fromInputStream(s3.getObject(bucketName, keyname).getObjectContent: InputStream).getLines }

Write to separate files in Apache Spark (with Java)

I am reading my data as whole text files. My object is of type Article which I defined. Here's the reading and processing of the data:
JavaPairRDD<String, String> filesRDD = context.wholeTextFiles(inputDataPath);
JavaRDD<Article> processingFiles = filesRDD.map(fileNameContent -> {
String content = fileNameContent._2();
Article a = new Article(content);
return a;
}
Now, once every file has been processed separately, I would like to write the result on HDFS as a separate file to, not with saveAsTextFile. I know that probably I have to do it with foreach, so:
processingFiles.foreach(a -> {
// Here is a pseudo code of how I want to do this
String fileName = here_is_full_file_name_to_write_to_hdfs;
writeToDisk(fileName, a); // This could be a simple text file
});
Any ideas how to do this in Java?

commons io FileUtils.writeStringToFile

I have a program written in Java reading two properties files - source.properties and destination.properties and write key/value pair of each line from source to destination. I decided to use FileUtils.writeStringToFile method from apache commons io api instead of PrintWriter or FileWriter from java standard api. What I found is only last line in the source file is overwritten to the destination file.
contents of the source.properties
username=a
host=abc
contents of the destination.properties
host=abc
static void writeToFile(Map<String,String> map, String pathToFile) {
Iterator<Map.Entry<String,String>> itr = map.entrySet().iterator();
File path = new File(pathToFile);
while(itr.hasNext()) {
Map.Entry<String,String> pairs = (Map.Entry<String,String>)itr.next();
FileUtils.writeStringToFile(path,pairs.getKey() + "=" + pairs.getValue());
}
}
map contains key/value pairs from the source file. When I debugged the program, I was able to see while loop go through two times and map contained all the correct data and the method from FileUtils called two times and wrote each line of data from the source file.
Can someone explain to me why I am getting aforementioned output?
[update]
I was able to achieve what I wanted in using PrintWriter.

You have to use the FileUtils#writeStringToFile with a boolean argument set to true to tell the utils method that it should append the String to the end of file and not overwite it.
#Deprecated
public static void writeStringToFile(File file,
String data,
boolean append)
throws IOException
So you code should be as below:
static void writeToFile(Map<String,String> map, String pathToFile)
{
Iterator<Map.Entry<String,String>> itr = map.entrySet().iterator();
File path = new File(pathToFile);
while(itr.hasNext()) {
Map.Entry<String,String> pairs = (Map.Entry<String,String>)itr.next();
FileUtils.writeStringToFile(path,
pairs.getKey() + "=" + pairs.getValue(),
true);// append rather than overwrite
}
}
Sidenote: This method is deprecated and you should use the one with Charset specified in method signature.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Convert String to JavaRDD<String> - java

Related

Spark Save as Text File grouped by Key

How to convert JavaDStream into RDD ? OR Is there a way i can create new RDD inside map function of JavaDStream?

Reading multiple files from S3 in parallel (Spark, Java)

Write to separate files in Apache Spark (with Java)

commons io FileUtils.writeStringToFile

Categories

Resources