ParquetFileReader not working with file stream - java

I'm trying to use ParquetFileReader to read files I'm receiving from S3 using a custom InputFile class since it's not a local file and I can't create a local temp file either.
He is my custom class based in this answer:
class ParquetInputFile(stream: ByteArray) : InputFile {
var data: ByteArray = stream
private class SeekableByteArrayInputStream(buf: ByteArray?) : ByteArrayInputStream(buf) {
var pos: Long = -1
}
override fun getLength(): Long {
return data.size.toLong()
}
override fun newStream(): SeekableInputStream {
return object : DelegatingSeekableInputStream(SeekableByteArrayInputStream(data)) {
override fun seek(newPos: Long) {
(stream as SeekableByteArrayInputStream).pos = newPos
}
override fun getPos(): Long {
return (stream as SeekableByteArrayInputStream).pos
}
}
}
override fun toString(): String {
return "com.test.ParquetInputFile[]"
}
}
And here is my code getting the file from S3 and using the classe above:
val bucketFile = s3Client.getObject(
GetObjectRequest(
"bucket-name",
"test_file.parquet"
)
)
val file = ParquetInputFile(bucketFile.objectContent.readAllBytes())
val reader = ParquetFileReader.open(file)
I'm getting an exception in the .open() execution when the reader try to read the file footer, it's saying it's not a parquet file while checking the "Magic" byte array
I made a fast test, using the same S3 file, but reading it from the local disk using a deprecated method from ParquetFileReader and it works:
val local = ParquetFileReader.open(Configuration(), Path("/Users/casky/Documents/pocs/resources/test_file.parquet"))
Debugging the same readFooter method, I saw that when it reads the fileMetadataLength here the size is different from local file and the S3 file, and actually, when the readIntLittleEndian() execute the 4 read functions here, the result returned in ch1, 2, 3 and 4 are:
Local File: 242, 7, 0, 0 returning 2034
S3 File: 80, 65, 82, 49 returning 827474256
But, as you can see, the values from ch1, 2, 3 and 4 are the correct value that the Parquet MAGIC array wants.
Now I'm not sure if the custom class is messing up it somehow, or if the Path object do something with the file content while reading it from local disk.

Related

S3 / MinIO with Java / Scala: Saving byte buffers chunks of files to object storage

So, imagine that I have a Scala Vert.x Web REST API that receives file uploads via HTTP multipart requests. However, it doesn't receive the incoming file data as a single InputStream. Instead, each file is received as a series of byte buffers handed over via a few callback functions.
The callbacks basically look like this:
// the callback that receives byte buffers (chunks) of the file being uploaded
// it is called multiple times until the full file has been received
upload.handler { buffer =>
// send chunk to backend
}
// the callback that gets called after the full file has been uploaded
// (i.e. after all chunks have been received)
upload.endHandler { _ =>
// do something after the file has been uploaded
}
// callback called if an exception is raised while receiving the file
upload.exceptionHandler { e =>
// do something to handle the exception
}
Now, I'd like to use these callbacks to save the file into a MinIO Bucket (MinIO, if you're unfamiliar, is basically self-hosted S3 and it's API is pretty much the same as the S3 Java API).
Since I don't have a file handle, I need to use putObject() to put an InputStream into MinIO.
The inefficient work-around I'm currently using with the MinIO Java API looks like this:
// this is all inside the context of handling a HTTP request
val out = new PipedOutputStream()
val in = new PipedInputStream()
var size = 0
in.connect(out)
upload.handler { buffer =>
s.write(buffer.getBytes)
size += buffer.length()
}
upload.endHandler { _ =>
minioClient.putObject(
PutObjectArgs.builder()
.bucket("my-bucket")
.object("my-filename")
.stream(in, size, 50000000)
.build())
}
Obviously, this isn't optimal. Since I'm using a simple java.io stream here, the entire file ends up getting loaded into memory.
I don't want to save the File to disk on the server before putting it into object storage. I'd like to put it straight into my object storage.
How could I accomplish this using the S3 API and a series of byte buffers given to me via the upload.handler callback?
EDIT
I should add that I am using MinIO because I cannot use a commercially-hosted cloud solution, like S3. However, as mentioned on MinIO's website, I can use Amazon's S3 Java SDK while using MinIO as my storage solution.
I attempted to follow this guide on Amazon's website for uploading objects to S3 in chunks.
That solution I attempted looks like this:
context.request.uploadHandler { upload =>
println(s"Filename: ${upload.filename()}")
val partETags = new util.ArrayList[PartETag]
val initRequest = new InitiateMultipartUploadRequest("docs", "my-filekey")
val initResponse = s3Client.initiateMultipartUpload(initRequest)
upload.handler { buffer =>
println("uploading part", buffer.length())
try {
val request = new UploadPartRequest()
.withBucketName("docs")
.withKey("my-filekey")
.withPartSize(buffer.length())
.withUploadId(initResponse.getUploadId)
.withInputStream(new ByteArrayInputStream(buffer.getBytes()))
val uploadResult = s3Client.uploadPart(request)
partETags.add(uploadResult.getPartETag)
} catch {
case e: Exception => println("Exception raised: ", e)
}
}
// this gets called for EACH uploaded file sequentially
upload.endHandler { _ =>
// upload successful
println("done uploading")
try {
val compRequest = new CompleteMultipartUploadRequest("docs", "my-filekey", initResponse.getUploadId, partETags)
s3Client.completeMultipartUpload(compRequest)
} catch {
case e: Exception => println("Exception raised: ", e)
}
context.response.setStatusCode(200).end("Uploaded")
}
upload.exceptionHandler { e =>
// handle the exception
println("exception thrown", e)
}
}
}
This works for files that are small (my test small file was 11 bytes), but not for large files.
In the case of large files, the processes inside the upload.handler get progressively slower as the file continues to upload. Also, upload.endHandler is never called, and the file somehow continues uploading after 100% of the file has been uploaded.
However, as soon as I comment out the s3Client.uploadPart(request) portion inside upload.handler and the s3Client.completeMultipartUpload parts inside upload.endHandler (basically throwing away the file instead of saving it to object storage), the file upload progresses as normal and terminates correctly.
I figured out what I was doing wrong (when using the S3 client). I was not accumulating bytes inside my upload.handler. I need to accumulate bytes until the buffer size is big enough to upload a part, rather than upload each time I receive a few bytes.
Since neither Amazon's S3 client nor the MinIO client did what I want, I decided to dig into how putObject() was actually implemented and make my own. This is what I came up with.
This implementation is specific to Vert.X, however it can easily be generalized to work with built-in java.io InputStreams via a while loop and using a pair of Piped- streams.
This implementation is also specific to MinIO, but it can easily be adapted to use the S3 client since, for the most part, the two APIs are the same.
In this example, Buffer is basically a container around a ByteArray and I'm not really doing anything special here. I replaced it with a byte array to ensure that it would still work, and it did.
package server
import com.google.common.collect.HashMultimap
import io.minio.MinioClient
import io.minio.messages.Part
import io.vertx.core.buffer.Buffer
import io.vertx.core.streams.ReadStream
import scala.collection.mutable.ListBuffer
class CustomMinioClient(client: MinioClient) extends MinioClient(client) {
def putReadStream(bucket: String = "my-bucket",
objectName: String,
region: String = "us-east-1",
data: ReadStream[Buffer],
objectSize: Long,
contentType: String = "application/octet-stream"
) = {
val headers: HashMultimap[String, String] = HashMultimap.create()
headers.put("Content-Type", contentType)
var uploadId: String = null
try {
val parts = new ListBuffer[Part]()
val createResponse = createMultipartUpload(bucket, region, objectName, headers, null)
uploadId = createResponse.result.uploadId()
var partNumber = 1
var uploadedSize = 0
// an array to use to accumulate bytes from the incoming stream until we have enough to make a `uploadPart` request
var partBuffer = Buffer.buffer()
// S3's minimum part size is 5mb, excepting the last part
// you should probably implement your own logic for determining how big
// to make each part based off the total object size to avoid unnecessary calls to S3 to upload small parts.
val minPartSize = 5 * 1024 * 1024
data.handler { buffer =>
partBuffer.appendBuffer(buffer)
val availableSize = objectSize - uploadedSize - partBuffer.length
val isMinPartSize = partBuffer.length >= minPartSize
val isLastPart = uploadedSize + partBuffer.length == objectSize
if (isMinPartSize || isLastPart) {
val partResponse = uploadPart(
bucket,
region,
objectName,
partBuffer.getBytes,
partBuffer.length,
uploadId,
partNumber,
null,
null
)
parts.addOne(new Part(partNumber, partResponse.etag))
uploadedSize += partBuffer.length
partNumber += 1
// empty the part buffer since we have already uploaded it
partBuffer = Buffer.buffer()
}
}
data.endHandler { _ =>
completeMultipartUpload(bucket, region, objectName, uploadId, parts.toArray, null, null)
}
data.exceptionHandler { exception =>
// should also probably abort the upload here
println("Handler caught exception in custom putObject: " + exception)
}
} catch {
// and abort it here as well...
case e: Exception =>
println("Exception thrown in custom `putObject`: " + e)
abortMultipartUpload(
bucket,
region,
objectName,
uploadId,
null,
null
)
}
}
}
This can all be used pretty easily.
First, set up the client:
private val _minioClient = MinioClient.builder()
.endpoint("http://localhost:9000")
.credentials("my-username", "my-password")
.build()
private val myClient = new CustomMinioClient(_minioClient)
Then, where you receive the upload request:
context.request.uploadHandler { upload =>
myClient.putReadStream(objectName = upload.filename(), data = upload, objectSize = myFileSize)
context.response().setStatusCode(200).end("done")
}
The only catch with this implementation is that you need to know the file sizes in advance for the request.
However, this can easily be solved the way I did it, especially if you're using a web UI.
Before attempting to upload the files, send a request to the server containing a map of file name to file size.
That pre-request should generate a unique ID for the upload.
The server can save group of filename->filesize using the upload ID as an index. - Server sends the upload ID back to the client.
Client sends the multipart upload request using the upload ID
Server pulls the list of files and their sizes and uses it to call .putReadStream()

(Akka HTTP) When I send an .XLSX file to the user as Array [bytes] then the user get the folder

I'm new to akka-http, I encountered a problem: there is a rout that returns a file with a report of a certain format. When an xlsx file is requested, the user receives a folder,
which you can rename this folder in report.xlsx and get a normal working excel file
I send the xlsx file as Array [bytes]
using akka-http 10.0.5
maybe someone faced such a problem:
path("api" / "reports" / "downloadReport") {
get {
parameters('format, 'jobId.as[Long]) { (format, jobId) =>
uncacheReport(jobId) match {
case None => complete(ResourceNotFound)
case Some(report) if format == "excel" =>
encodeResponse {
val reportHtml: String = representerSupport.represent(report, HTMLFormat.Excel)
complete(HttpEntity(ContentType(MediaTypes.`application/vnd.openxmlformats-officedocument.spreadsheetml.sheet`), ByteString(reportwriter bytes reportHtml)))
}
}
}
}
}
adding the header "Content-Disposition" solved the problem, thanks #PJFanning
encodeResponse {
val reportHtml: String = representerSupport.represent(report, HTMLFormat.Excel)(user.settings.interfaceSettings)
val header = RawHeader("Content-Disposition", "filename=report.xlsx")
respondWithHeader(header) {
complete(HttpEntity(ContentType(MediaTypes.`application/vnd.openxmlformats-officedocument.spreadsheetml.sheet`), reportwriter bytes reportHtml))
}
}

trying to convert java file to bytestring for using in stream

Im creating an api to send me files and I will upload them to S3, and im using akka-stream-alpakka-s3 library to do it using streams.
my issue is that in my controller I can convert the file to Jave file:
def uploadToS3() = Action(parse.multipartFormData) { request =>
request.body.file("file").map { filePart =>
val filename = Paths.get(filePart.filename).getFileName
val file = Paths.get(s"/tmp/$filename").toFile // this is java file
saveToS3(file, filename)
}
...
}
and in my s3 service func I counld only use scala file since it have a "toByteArray" func and I need it for the source, it looks like this:
import scala.reflect.io.File
class S3Service #Inject()(mys3Client: S3Client,
configuration: Configuration,
implicit val executionContext: ExecutionContext,
implicit val actorSystem: ActorSystem) {
implicit val materializer: ActorMaterializer = ActorMaterializer()
val bucket: String = configuration.get[String]("my.aws.s3.bucket")
// func to save to s3
def saveToS3(file: File, fileName: String): Future[AWSLocation] = {
// here im creating uuid so i to pass as directory so it will be possible to have files with the same name
val fileNameUUID: String = s"${UUID.randomUUID()}-$fileName"
// this will be my sinc for the stream
val s3Sink: Sink[ByteString, Future[MultipartUploadResult]] = mys3Client.multipartUpload(s"$bucket/$fileNameUUID", fileName)
// here is my issue: i need to transform the file to bytstring so I can creat it as the source but the file im getting from the controller is Java file and the function to create byteString is of Scala file so had to use scala file in this func.
Future.fromTry( Try{
ByteString(file.toByteArray())
}).flatMap { byteString =>
Source.single(byteString).runWith(s3Sink) map { res =>
AWSLocation(s"$bucket/$fileNameUUID", res.key)
}
}.recover {
case ex: S3Exception =>
logger.error("some message", ex)
throw ex
case ex: Throwable =>
logger.error("some message", ex)
throw ex
}
}
}
what would be the best way to align the file types so I will be able to pass bytestring file to my Source?
Take a look at FileIO.fromPath which will give you Source[ByteString, ...] from java.nio.file.Path.

Play Framework eating up disk space

I am successfully serving videos using the Play framework, but I'm experiencing an issue: each time a file is served, the Play framework creates a copy in C:\Users\user\AppData\Temp. I'm serving large files so this quickly creates a problem with disk space.
Is there any way to serve a file in Play without creating a copy? Or have Play automatically delete the temp file?
Code I'm using to serve is essentially:
public Result video() {
return ok(new File("whatever"));
}
Use Streaming
I use following method for video streaming. This code does not create temp copies of the media file.
Basically this code responds to the RANGE queries sent by the browser. If browser does not support RANGE queries I fallback to the method where I try to send the whole file using Ok.sendFile (internally play also tries to stream the file) (this might create temp files). but this happens very rarely when range queries is not supported by the browser.
GET /media controllers.MediaController.media
Put this code inside a Controller called MediaController
def media = Action { req =>
val file = new File("/Users/something/Downloads/somefile.mp4")
val rangeHeaderOpt = req.headers.get(RANGE)
rangeHeaderOpt.map { range =>
val strs = range.substring("bytes=".length).split("-")
if (strs.length == 1) {
val start = strs.head.toLong
val length = file.length() - 1L
partialContentHelper(file, start, length)
} else {
val start = strs.head.toLong
val length = strs.tail.head.toLong
partialContentHelper(file, start, length)
}
}.getOrElse {
Ok.sendFile(file)
}
}
def partialContentHelper(file: File, start: Long, length: Long) = {
val fis = new FileInputStream(file)
fis.skip(start)
val byteStringEnumerator = Enumerator.fromStream(fis).&>(Enumeratee.map(ByteString.fromArray(_)))
val mediaSource = Source.fromPublisher(Streams.enumeratorToPublisher(byteStringEnumerator))
PartialContent.sendEntity(HttpEntity.Streamed(mediaSource, None, None)).withHeaders(
CONTENT_TYPE -> MimeTypes.forExtension("mp4").get,
CONTENT_LENGTH -> ((length - start) + 1).toString,
CONTENT_RANGE -> s"bytes $start-$length/${file.length()}",
ACCEPT_RANGES -> "bytes",
CONNECTION -> "keep-alive"
)
}

Write to separate files in Apache Spark (with Java)

I am reading my data as whole text files. My object is of type Article which I defined. Here's the reading and processing of the data:
JavaPairRDD<String, String> filesRDD = context.wholeTextFiles(inputDataPath);
JavaRDD<Article> processingFiles = filesRDD.map(fileNameContent -> {
String content = fileNameContent._2();
Article a = new Article(content);
return a;
}
Now, once every file has been processed separately, I would like to write the result on HDFS as a separate file to, not with saveAsTextFile. I know that probably I have to do it with foreach, so:
processingFiles.foreach(a -> {
// Here is a pseudo code of how I want to do this
String fileName = here_is_full_file_name_to_write_to_hdfs;
writeToDisk(fileName, a); // This could be a simple text file
});
Any ideas how to do this in Java?

Categories