Reading multiple files from S3 in parallel (Spark, Java)

Reading multiple files from S3 in parallel (Spark, Java) - java

I saw a few discussions on this but couldn't quite understand the right solution:
I want to load a couple hundred files from S3 into an RDD. Here is how I'm doing it now:
ObjectListing objectListing = s3.listObjects(new ListObjectsRequest().
withBucketName(...).
withPrefix(...));
List<String> keys = new LinkedList<>();
objectListing.getObjectSummaries().forEach(summery -> keys.add(summery.getKey())); // repeat while objectListing.isTruncated()
JavaRDD<String> events = sc.parallelize(keys).flatMap(new ReadFromS3Function(clusterProps));
The ReadFromS3Function does the actual reading using the AmazonS3 client:
public Iterator<String> call(String s) throws Exception {
AmazonS3 s3Client = getAmazonS3Client(properties);
S3Object object = s3Client.getObject(new GetObjectRequest(...));
InputStream is = object.getObjectContent();
List<String> lines = new LinkedList<>();
String str;
try {
BufferedReader reader = new BufferedReader(new InputStreamReader(is));
if (is != null) {
while ((str = reader.readLine()) != null) {
lines.add(str);
}
} else {
...
}
} finally {
...
}
return lines.iterator();
I kind of "translated" this from answers I saw for the same question in Scala. I think it's also possible to pass the entire list of paths to sc.textFile(...), but I'm not sure which is the best-practice way.

the underlying problem is that listing objects in s3 is really slow, and the way it is made to look like a directory tree kills performance whenever something does a treewalk (as wildcard pattern maching of paths does).
The code in the post is doing the all-children listing which delivers way better performance, it's essentially what ships with Hadoop 2.8 and s3a listFiles(path, recursive) see HADOOP-13208.
After getting that listing, you've got strings to objects paths which you can then map to s3a/s3n paths for spark to handle as text file inputs, and which you can then apply work to
val files = keys.map(key -> s"s3a://$bucket/$key").mkString(",")
sc.textFile(files).map(...)
And as requested, here's the java code used.
String prefix = "s3a://" + properties.get("s3.source.bucket") + "/";
objectListing.getObjectSummaries().forEach(summary -> keys.add(prefix+summary.getKey()));
// repeat while objectListing truncated
JavaRDD<String> events = sc.textFile(String.join(",", keys))
Note that I switched s3n to s3a, because, provided you have the hadoop-aws and amazon-sdk JARs on your CP, the s3a connector is the one you should be using. It's better, and its the one which gets maintained and tested against spark workloads by people (me). See The history of Hadoop's S3 connectors.

You may use sc.textFile to read multiple files.
You can pass multiple file url with as its argument.
You can specify whole directories, use wildcards and even CSV of directories and wildcards.
Ex:
sc.textFile("/my/dir1,/my/paths/part-00[0-5]*,/another/dir,/a/specific/file")
Reference from this ans

I guess if you try to parallelize while reading aws will be utilizing executor and definitely improve the performance
val bucketName=xxx
val keyname=xxx
val df=sc.parallelize(new AmazonS3Client(new BasicAWSCredentials("awsccessKeyId", "SecretKey")).listObjects(request).getObjectSummaries.map(_.getKey).toList)
.flatMap { key => Source.fromInputStream(s3.getObject(bucketName, keyname).getObjectContent: InputStream).getLines }

Related

Slow operations in parallel

I need help with running parallel operations. The goal of the code is to extract a large amount of small files from the same tar in different folders in a very short time
This is the code:
public void decompress(File archive, File destination) throws RuntimeException {
try (InputStream in = new FileInputStream(archive);
BufferedInputStream buff = new BufferedInputStream(in);
TarArchiveInputStream is = (TarArchiveInputStream) new ArchiveStreamFactory().createArchiveInputStream("tar", buff)
) {
TarArchiveEntry entry;
while ((entry = is.getNextTarEntry()) != null) {
File file = new File(destination, entry.getName());
file.getParentFile().mkdirs();
Files.write(file.toPath(), is.readAllBytes());
}
} catch (IOException | ArchiveException e) {
e.printStackTrace();
}
}
When I execute one time this operation, it takes ~900ms
But when I do something like this to execute the same operation, multiple times in parallel it takes 20000ms:
ExecutorService EXECUTOR_SERVICE = Executors.newFixedThreadPool(20);
File archive = ...;
for (int i = 0; i < 5; i++) {
File directory = new File("Dir_" + i);
EXECUTOR_SERVICE.submit(() -> decompress(archive, directory));
}
or
File archive = ...;
for (int i = 0; i < 5; i++) {
File directory = new File("Dir_" + i);
new Thread(() -> decompress(archive, directory)).start();
}

One suspicion is that the directories contain many files, hence File.mkdirs does needlessly much checks.
The constructor of BufferedInputStream may have a custom buffer size. Never helped much, but it might be with your disk. Also with parallelism it could help to prevent much "disk head movements."
You probably already tried Files.copy but still, it might have a better memory behavior that readAllBytes.
So the version becomes (eschewing File in favor of Path):
public void decompress(File archive, File destination) throws RuntimeException {
final int bufferSize = 1024 * 128;
Path archivePath = archive.toPath();
Path destinationPath = destination.toPath();
try (InputStream in = new FileInputStream(archive);
BufferedInputStream buff = new BufferedInputStream(in, bufferSize);
TarArchiveInputStream is = (TarArchiveInputStream)
new ArchiveStreamFactory().createArchiveInputStream("tar", buff)
) {
Path oldFileParent = destinationPath;
oldFileParent.createDirectories();
TarArchiveEntry entry;
while ((entry = is.getNextTarEntry()) != null) {
Path file = Paths.get(destinationPath, entry.getName());
Path fileParent = file.getParent();
if (!fileParent.equals(oldFileParent)) {
oldFileParent = fileParent;
oldFileParent.createDirectories();
}
Files.copy(is, file);
//Files.write(file, is.readAllBytes());
}
} catch (IOException | ArchiveException e) {
e.printStackTrace();
}
}
Throwing a RuntimeException and capturing the IOException/ArchiveException without throwing it back (as new IllegalStateException(e)) is a matter of taste.
Now to adding parallelism: disk output is probably the bottleneck. Writing two files to the same disk in parallel means skipping back and forth on the disk. Small files might just do.
Better seems to parallelize reading a next file and then in another thread write it.
Two threads might theoretically perform better than many threads with enhightened disk traffic. readAllBytes might then be appropriate, to let the writing thread not use is.
As in the tar entry maybe the file size is kept too, that would allow to check whether readAllBytes is efficient enough - for large files.
Logging was mentioned in this question. It is known, that that can consume much time, and with parallelism becomes even more critical. But you seem to be aware of it. You wrote having written your own logger. For a library System.Logger is actually best. It is a façade that uses any logger the application provides. This would have prevented the logger vulnaribility hidden in library dependencies of the past year.

Ignoring the fact that you are not decompressing the file in parallel here (you are running multiple threads decompressing the same file concurrently, essentially overwriting the result), there may be several reasons for this performance hit. I/O is one, so it depends on the underlying implementation. Also, what is the Logger you are using there? While other parts of your code doesn't seem to be shared among multiple threads, the static call to Logger is something that is shared.
Also note: java.nio uses FileChannels which provide synchronous I/O, so depending on how you create the channels, you may get into similar situations (though I don't believe this applies here).

AWS Creating new files from an s3 object using JAVA getting error

I have a shape file and i need to read the shape file from my java code. I used below code for reading shape file.
public class App {
public static void main(String[] args) {
File file = new File("C:\\Test\\sample.shp");
Map<String, Object> map = new HashMap<>();//
try {
map.put("url", URLs.fileToUrl(file));
DataStore dataStore = DataStoreFinder.getDataStore(map);
String typeName = dataStore.getTypeNames()[0];
SimpleFeatureSource source = dataStore.getFeatureSource(typeName);
SimpleFeatureCollection collection = source.getFeatures();
try (FeatureIterator<SimpleFeature> features = collection.features()) {
while (features.hasNext()) {
SimpleFeature feature = features.next();
SimpleFeatureType schema = feature.getFeatureType();
Class<?> geomType = schema.getGeometryDescriptor().getType().getBinding();
String type = "";
if (Polygon.class.isAssignableFrom(geomType) || MultiPolygon.class.isAssignableFrom(geomType)) {
MultiPolygon geom = (MultiPolygon) feature.getDefaultGeometry();
type = "Polygon";
if (geom.getNumGeometries() > 1) {
type = "MultiPolygon";
}
} else if (LineString.class.isAssignableFrom(geomType)
|| MultiLineString.class.isAssignableFrom(geomType)) {
} else {
}
System.out.println(feature.getDefaultGeometryProperty().getValue().toString());
}
}
} catch (Exception e) {
// TODO: handle exception
}
}
}
I got the desired output. But my requirement is write an aws lambda function to read shape file. For this
1. I created a Lambda java project of s3 event. I wrote the same code inside the handleRequest. I uploaded the java lambda project as a lanbda function and added one trigger. When I am uploading a .shp file to as s3 bucket lmbda function will automatically invoked. But I am getting an error like below
java.lang.RuntimeException: java.io.FileNotFoundException: /sample.shp (No such file or directory)
I have sample.shp file inside my s3 bucket. I go through below link.
How to write an S3 object to a file?
I am getting the same error. I tried to change my code like below
S3Object object = s3.getObject(new GetObjectRequest(bucket, key));
InputStream objectData = object.getObjectContent();
map.put("url", objectData );
instead of
File file = new File("C:\\Test\\sample.shp");
map.put("url", URLs.fileToUrl(file));
:-( Now i am getting an error like below
java.lang.NullPointerException
Also I tried the below code
DataStore dataStore = DataStoreFinder.getDataStore(objectData);
instead of
DataStore dataStore = DataStoreFinder.getDataStore(map);
the error was like below
java.lang.ClassCastException:
com.amazonaws.services.s3.model.S3ObjectInputStream cannot be cast to
java.util.Map
Also I tried to add key directly to the map and also as DataStore object. Everything went wrong..:-(
Is there anyone who can help me?
It will be very helpful if someone can do it for me...

The DataStoreFinder.getDataStore method in geotools requires you to provide a map containing a key/value pair with key "url". The value associated with that "url" key needs to be a file URL like "file://host/path/my.shp".
You're trying to insert a Java input stream into the map. That won't work, because it's not a file URL.
The geotools library does not accept http/https URLs (see the geotools code here and here), so you need a file:// URL. That means you will need to download the file from S3 to the local Lambda filesystem and then provide a file:// URL pointing to that local file. To do that, here's Java code that should work:
// get the shape file from S3 to local filesystem
File localshp = new File("/tmp/download.shp");
s3.getObject(new GetObjectRequest(bucket, key), localshp);
// now store file:// URL in the map
map.put("url", localshp.getURI().getURL().toString());
If the geotools library had accepted real URLs (not just file:// URLs) then you could have avoided the download and simply created a time-limited, pre-signed URL for the S3 object and put that URL into the map.
Here's an example of how to do that:
// get current time and add one hour
java.util.Date expiration = new java.util.Date();
long msec = expiration.getTime();
msec += 1000 * 60 * 60;
expiration.setTime(msec);
// request pre-signed URL that will allow bearer to GET the object
GeneratePresignedUrlRequest gpur = new GeneratePresignedUrlRequest(bucket, key);
gpur.setMethod(HttpMethod.GET);
gpur.setExpiration(expiration);
// get URL that will expire in one hour
URL url = s3.generatePresignedUrl(gpur);

Export and Import apacheds data into LDIF programmatically from java

I have create a server in Apache Directory Studio. I also created a partition and inserted some entries to that server form Java. Now I want to Backup and Restore this data in and LDIF file programmatically. I am new to LDAP. So please show me a detailed way to Export and Import entries programmatically using java from my server into LDIF.
Current solution:
Now I am using this approach to backup:
EntryCursor cursor = connection.search(new Dn("o=partition"), "(ObjectClass=*)", SearchScope.SUBTREE, "*", "+");
Charset charset = Charset.forName("UTF-8");
Path filePath = Paths.get("src/main/resources", "backup.ldif");
BufferedWriter writer = Files.newBufferedWriter(filePath, charset);
String st = "";
while (cursor.next()) {
Entry entry = cursor.get();
String ss = LdifUtils.convertToLdif(entry);
st += ss + "\n";
}
writer.write(st);
writer.close();
For restore I am using this:
InputStream is = new FileInputStream(filepath);
LdifReader entries = new LdifReader(is);
for (LdifEntry ldifEntry : entries) {
Entry entry = ldifEntry.getEntry();
AddRequest addRequest = new AddRequestImpl();
addRequest.setEntry(entry);
addRequest.addControl(new ManageDsaITImpl());
AddResponse res = connection.add(addRequest);
}
But I am not sure whether this is the correct way.
Problem of this solution:
When I backup my database, it writes entries into LDIF in a random way, so restore does not works until I fix the order of entries manually. I there any better way? Please someone help me.

After a long search, I actually understand that the solution of restore the entries is a simple recursion. In backup procedure does not print the entries in random way, it maintain the tree order. So a simple recursion can order the entries well. Here is a sample code which I use-
void findEntry(LdapConnection connection, Entry entry, StringBuilder sb)
throws LdapException, CursorException {
sb.append(LdifUtils.convertToLdif(entry));
sb.append("\n");
EntryCursor cursor = connection.search(entry.getDn(), "(ObjectClass=*)", SearchScope.ONELEVEL, "*", "+");
while (cursor.next()) {
findEntry(connection, cursor.get(), sb);
}
}

Well, you tagged as Java and so look at the UnboundID LDAP SDK or as you are using APacheDS, why not look at the Apache LDAP API
Either of them will work. I currently use [UnboundID LDAP SDK] which have [LDIF specific APIs].3. I assume [Apache LDAP API] does also, But I have not used them.

How to read files with an offset from Hadoop using Java

Problem: I want to read a section of a file from HDFS and return it, such as lines 101-120 from a file of 1000 lines.
I don't want to use seek because I have read that it is expensive.
I have log files which I am using PIG to process down into meaningful sets of data. I've been writing an API to return the data for consumption and display by a front end. Those processed data sets can be large enough that I don't want to read the entire file out of Hadoop in one slurp to save wire time and bandwidth. (Let's say 5 - 10MB)
Currently I am using a BufferedReader to return small summary files which is working fine
ArrayList lines = new ArrayList();
...
for (FileStatus item: items) {
// ignoring files like _SUCCESS
if(item.getPath().getName().startsWith("_")) {
continue;
}
in = fs.open(item.getPath());
BufferedReader br = new BufferedReader(new InputStreamReader(in));
String line;
line = br.readLine();
while (line != null) {
line = line.replaceAll("(\\r|\\n)", "");
lines.add(line.split("\t"));
line = br.readLine();
}
}
I've poked around the interwebs quite a bit as well as Stack but haven't found exactly what I need.
Perhaps this is completely the wrong way to go about doing it and I need a completely separate set of code and different functions to manage this. Open to any suggestions.
Thanks!
As added noted based on research from the below discussions:
How does Hadoop process records records split across block boundaries?
Hadoop FileSplit Reading

I think SEEK is a best option for reading files with huge volumes. It did not cause any problems to me as the volume of data that i was reading was in the range of 2 - 3GB. I did not encounter any issues till today but we did use file splitting to handle the large data set. below is the code which you can use for reading purpose and test the same.
public class HDFSClientTesting {
/**
* #param args
*/
public static void main(String[] args) {
// TODO Auto-generated method stub
try{
//System.loadLibrary("libhadoop.so");
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
conf.addResource(new Path("core-site.xml"));
String Filename = "/dir/00000027";
long ByteOffset = 3185041;
SequenceFile.Reader rdr = new SequenceFile.Reader(fs, new Path(Filename), conf);
Text key = new Text();
Text value = new Text();
rdr.seek(ByteOffset);
rdr.next(key,value);
//Plain text
JSONObject jso = new JSONObject(value.toString());
String content = jso.getString("body");
System.out.println("\n\n\n" + content + "\n\n\n");
File file =new File("test.gz");
file.createNewFile();
}
catch (Exception e ){
throw new RuntimeException(e);
}
finally{
}
}
}

How to chain multiple different InputStreams into one InputStream

I'm wondering if there is any ideomatic way to chain multiple InputStreams into one continual InputStream in Java (or Scala).
What I need it for is to parse flat files that I load over the network from an FTP-Server. What I want to do is to take file[1..N], open up streams and then combine them into one stream. So when file1 comes to an end, I want to start reading from file2 and so on, until I reach the end of fileN.
I need to read these files in a specific order, data comes from a legacy system that produces files in barches so data in one depends on data in another file, but I would like to handle them as one continual stream to simplify my domain logic interface.
I searched around and found PipedInputStream, but I'm not positive that is what I need. An example would be helpful.

It's right there in JDK! Quoting JavaDoc of SequenceInputStream:
A SequenceInputStream represents the logical concatenation of other input streams. It starts out with an ordered collection of input streams and reads from the first one until end of file is reached, whereupon it reads from the second one, and so on, until end of file is reached on the last of the contained input streams.
You want to concatenate arbitrary number of InputStreams while SequenceInputStream accepts only two. But since SequenceInputStream is also an InputStream you can apply it recursively (nest them):
new SequenceInputStream(
new SequenceInputStream(
new SequenceInputStream(file1, file2),
file3
),
file4
);
...you get the idea.
See also
How do you merge two input streams in Java? (dup?)

This is done using SequencedInputStream, which is straightforward in Java, as Tomasz Nurkiewicz's answer shows. I had to do this repeatedly in a project recently, so I added some Scala-y goodness via the "pimp my library" pattern.
object StreamUtils {
implicit def toRichInputStream(str: InputStream) = new RichInputStream(str)
class RichInputStream(str: InputStream) {
// a bunch of other handy Stream functionality, deleted
def ++(str2: InputStream): InputStream = new SequenceInputStream(str, str2)
}
}
With that, I can do stream sequencing as follows
val mergedStream = stream1++stream2++stream3
or even
val streamList = //some arbitrary-length list of streams, non-empty
val mergedStream = streamList.reduceLeft(_++_)

Another solution: first create a list of input stream and then create the sequence of input streams:
List<InputStream> iss = Files.list(Paths.get("/your/path"))
.filter(Files::isRegularFile)
.map(f -> {
try {
return new FileInputStream(f.toString());
} catch (Exception e) {
throw new RuntimeException(e);
}
}).collect(Collectors.toList());
new SequenceInputStream(Collections.enumeration(iss)))

Here is a more elegant solution using Vector, this is for Android specifically but use vector for any Java
AssetManager am = getAssets();
Vector v = new Vector(Constant.PAGES);
for (int i = 0; i < Constant.PAGES; i++) {
String fileName = "file" + i + ".txt";
InputStream is = am.open(fileName);
v.add(is);
}
Enumeration e = v.elements();
SequenceInputStream sis = new SequenceInputStream(e);
InputStreamReader isr = new InputStreamReader(sis);
Scanner scanner = new Scanner(isr); // or use bufferedReader

Here's a simple Scala version that concatenates an Iterator[InputStream]:
import java.io.{InputStream, SequenceInputStream}
import scala.collection.JavaConverters._
def concatInputStreams(streams: Iterator[InputStream]): InputStream =
new SequenceInputStream(streams.asJavaEnumeration)

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Reading multiple files from S3 in parallel (Spark, Java) - java

You may use sc.textFile to read multiple files. You can pass multiple file url with as its argument. You can specify whole directories, use wildcards and even CSV of directories and wildcards. Ex: sc.textFile("/my/dir1,/my/paths/part-00[0-5]*,/another/dir,/a/specific/file") Reference from this ans

Related

Slow operations in parallel

AWS Creating new files from an s3 object using JAVA getting error

Export and Import apacheds data into LDIF programmatically from java

How to read files with an offset from Hadoop using Java

How to chain multiple different InputStreams into one InputStream

Categories

Resources