Upload a video file by chunks - java

Yes, it's a long question with a lot of detail... So, my question is: How can I stream an upload to Vimeo in segments?
For anyone wanting to copy and debug on their own machine: Here are the things you need:
My code here.
Include the Scribe library found here
Have a valid video file (mp4) which is at least greater than 10 MB and put it in the directory C:\test.mp4 or change that code to point wherever yours is.
That's it! Thanks for helping me out!
Big update: I've left a working API Key and Secret for Vimeo in the code here. So as long as you have a Vimeo account, all the code should work just fine for you once you've allowed the application and entered your token. Just copy the code from that link into a project on your favorite IDE and see if you can fix this with me. I'll give the bounty to whoever gives me the working code. Thanks! Oh, and don't expect to use this Key and Secret for long. Once this problem's resolved I'll delete it. :)
Overview of the problem: The problem is when I send the last chunk of bytes to Vimeo and then verify the upload, the response returns that the length of all the content is the length of only the last chunk, not all the chunks combined as it should be.
SSCCE Note: I have my entire SSCCE here. I put it somewhere else so it can be C ompilable. It is NOT very S hort (about 300 lines), but hopefully you find it to be S elf-contained, and it's certainly an E xample!). I am, however, posting the relevant portions of my code in this post.
This is how it works: When you upload a video to Vimeo via the streaming method (see Upload API documentation here for setup to get to this point), you have to give a few headers: endpoint, content-length, and content-type. The documentation says it ignores any other headers. You also give it a payload of the byte information for the file you're uploading. And then sign and send it (I have a method which will do this using scribe).
My problem: Everything works great when I just send the video in one request. My problem is in cases when I'm uploading several bigger files, the computer I'm using doesn't have enough memory to load all of that byte information and put it in the HTTP PUT request, so I have to split it up into 1 MB segments. This is where things get tricky. The documentation mentions that it's possible to "resume" uploads, so I'm trying to do that with my code, but it's not working quite right. Below, you'll see the code for sending the video. Remember my SSCCE is here.
Things I've tried: I'm thinking it has something to do with the Content-Range header... So here are the things I've tried in changing what the Content-Range header says...
Not adding content range header to the first chunk
Adding a prefix to the content range header (each with a combination of the previous header):
"bytes"
"bytes " (throws connection error, see the very bottom for the error) --> It appears in the documentation that this is what they're looking for, but I'm pretty sure there are typos in the documentation because they have the content-range header on their "resume" example as: 1001-339108/339108 when it should be 1001-339107/339108. So... Yeah...
"bytes%20"
"bytes:"
"bytes: "
"bytes="
"bytes= "
Not adding anything as a prefix to the content range header
Here's the code:
/**
* Send the video data
*
* #return whether the video successfully sent
*/
private static boolean sendVideo(String endpoint, File file) throws FileNotFoundException, IOException {
// Setup File
long contentLength = file.length();
String contentLengthString = Long.toString(contentLength);
FileInputStream is = new FileInputStream(file);
int bufferSize = 10485760; // 10 MB = 10485760 bytes
byte[] bytesPortion = new byte[bufferSize];
int byteNumber = 0;
int maxAttempts = 1;
while (is.read(bytesPortion, 0, bufferSize) != -1) {
String contentRange = Integer.toString(byteNumber);
long bytesLeft = contentLength - byteNumber;
System.out.println(newline + newline + "Bytes Left: " + bytesLeft);
if (bytesLeft < bufferSize) {
//copy the bytesPortion array into a smaller array containing only the remaining bytes
bytesPortion = Arrays.copyOf(bytesPortion, (int) bytesLeft);
//This just makes it so it doesn't throw an IndexOutOfBounds exception on the next while iteration. It shouldn't get past another iteration
bufferSize = (int) bytesLeft;
}
byteNumber += bytesPortion.length;
contentRange += "-" + (byteNumber - 1) + "/" + contentLengthString;
int attempts = 0;
boolean success = false;
while (attempts < maxAttempts && !success) {
int bytesOnServer = sendVideoBytes("Test video", endpoint, contentLengthString, "video/mp4", contentRange, bytesPortion, first);
if (bytesOnServer == byteNumber) {
success = true;
} else {
System.out.println(bytesOnServer + " != " + byteNumber);
System.out.println("Success is not true!");
}
attempts++;
}
first = true;
if (!success) {
return false;
}
}
return true;
}
/**
* Sends the given bytes to the given endpoint
*
* #return the last byte on the server (from verifyUpload(endpoint))
*/
private static int sendVideoBytes(String videoTitle, String endpoint, String contentLength, String fileType, String contentRange, byte[] fileBytes, boolean addContentRange) throws FileNotFoundException, IOException {
OAuthRequest request = new OAuthRequest(Verb.PUT, endpoint);
request.addHeader("Content-Length", contentLength);
request.addHeader("Content-Type", fileType);
if (addContentRange) {
request.addHeader("Content-Range", contentRangeHeaderPrefix + contentRange);
}
request.addPayload(fileBytes);
Response response = signAndSendToVimeo(request, "sendVideo on " + videoTitle, false);
if (response.getCode() != 200 && !response.isSuccessful()) {
return -1;
}
return verifyUpload(endpoint);
}
/**
* Verifies the upload and returns whether it's successful
*
* #param endpoint to verify upload to
* #return the last byte on the server
*/
public static int verifyUpload(String endpoint) {
// Verify the upload
OAuthRequest request = new OAuthRequest(Verb.PUT, endpoint);
request.addHeader("Content-Length", "0");
request.addHeader("Content-Range", "bytes */*");
Response response = signAndSendToVimeo(request, "verifyUpload to " + endpoint, true);
if (response.getCode() != 308 || !response.isSuccessful()) {
return -1;
}
String range = response.getHeader("Range");
//range = "bytes=0-10485759"
return Integer.parseInt(range.substring(range.lastIndexOf("-") + 1)) + 1;
//The + 1 at the end is because Vimeo gives you 0-whatever byte where 0 = the first byte
}
Here's the signAndSendToVimeo method:
/**
* Signs the request and sends it. Returns the response.
*
* #param service
* #param accessToken
* #param request
* #return response
*/
public static Response signAndSendToVimeo(OAuthRequest request, String description, boolean printBody) throws org.scribe.exceptions.OAuthException {
System.out.println(newline + newline
+ "Signing " + description + " request:"
+ ((printBody && !request.getBodyContents().isEmpty()) ? newline + "\tBody Contents:" + request.getBodyContents() : "")
+ ((!request.getHeaders().isEmpty()) ? newline + "\tHeaders: " + request.getHeaders() : ""));
service.signRequest(accessToken, request);
printRequest(request, description);
Response response = request.send();
printResponse(response, description, printBody);
return response;
}
And here's some (an example... All of the output can be found here) of the output from the printRequest and printResponse methods: NOTE This output changes depending on what the contentRangeHeaderPrefix is set to and the first boolean is set to (which specifies whether or not to include the Content-Range header on the first chunk).
We're sending the video for upload!
Bytes Left: 15125120
Signing sendVideo on Test video request:
Headers: {Content-Length=15125120, Content-Type=video/mp4, Content-Range=bytes%200-10485759/15125120}
sendVideo on Test video >>> Request
Headers: {Authorization=OAuth oauth_signature="zUdkaaoJyvz%2Bt6zoMvAFvX0DRkc%3D", oauth_version="1.0", oauth_nonce="340477132", oauth_signature_method="HMAC-SHA1", oauth_consumer_key="5cb447d1fc4c3308e2c6531e45bcadf1", oauth_token="460633205c55d3f1806bcab04174ae09", oauth_timestamp="1334336004", Content-Length=15125120, Content-Type=video/mp4, Content-Range=bytes: 0-10485759/15125120}
Verb: PUT
Complete URL: http://174.129.125.96:8080/upload?ticket_id=5ea64d64547e38e5e3c121852b2d306d
sendVideo on Test video >>> Response
Code: 200
Headers: {null=HTTP/1.1 200 OK, Content-Length=0, Connection=close, Content-Type=text/plain, Server=Vimeo/1.0}
Signing verifyUpload to http://174.129.125.96:8080/upload?ticket_id=5ea64d64547e38e5e3c121852b2d306d request:
Headers: {Content-Length=0, Content-Range=bytes */*}
verifyUpload to http://174.129.125.96:8080/upload?ticket_id=5ea64d64547e38e5e3c121852b2d306d >>> Request
Headers: {Authorization=OAuth oauth_signature="FQg8HJe84nrUTdyvMJGM37dpNpI%3D", oauth_version="1.0", oauth_nonce="298157825", oauth_signature_method="HMAC-SHA1", oauth_consumer_key="5cb447d1fc4c3308e2c6531e45bcadf1", oauth_token="460633205c55d3f1806bcab04174ae09", oauth_timestamp="1334336015", Content-Length=0, Content-Range=bytes */*}
Verb: PUT
Complete URL: http://174.129.125.96:8080/upload?ticket_id=5ea64d64547e38e5e3c121852b2d306d
verifyUpload to http://174.129.125.96:8080/upload?ticket_id=5ea64d64547e38e5e3c121852b2d306d >>> Response
Code: 308
Headers: {null=HTTP/1.1 308 Resume Incomplete, Range=bytes=0-10485759, Content-Length=0, Connection=close, Content-Type=text/plain, Server=Vimeo/1.0}
Body:
Bytes Left: 4639360
Signing sendVideo on Test video request:
Headers: {Content-Length=15125120, Content-Type=video/mp4, Content-Range=bytes: 10485760-15125119/15125120}
sendVideo on Test video >>> Request
Headers: {Authorization=OAuth oauth_signature="qspQBu42HVhQ7sDpzKGeu3%2Bn8tM%3D", oauth_version="1.0", oauth_nonce="183131870", oauth_signature_method="HMAC-SHA1", oauth_consumer_key="5cb447d1fc4c3308e2c6531e45bcadf1", oauth_token="460633205c55d3f1806bcab04174ae09", oauth_timestamp="1334336015", Content-Length=15125120, Content-Type=video/mp4, Content-Range=bytes%2010485760-15125119/15125120}
Verb: PUT
Complete URL: http://174.129.125.96:8080/upload?ticket_id=5ea64d64547e38e5e3c121852b2d306d
sendVideo on Test video >>> Response
Code: 200
Headers: {null=HTTP/1.1 200 OK, Content-Length=0, Connection=close, Content-Type=text/plain, Server=Vimeo/1.0}
Signing verifyUpload to http://174.129.125.96:8080/upload?ticket_id=5ea64d64547e38e5e3c121852b2d306d request:
Headers: {Content-Length=0, Content-Range=bytes */*}
verifyUpload to http://174.129.125.96:8080/upload?ticket_id=5ea64d64547e38e5e3c121852b2d306d >>> Request
Headers: {Authorization=OAuth oauth_signature="IdhhhBryzCa5eYqSPKAQfnVFpIg%3D", oauth_version="1.0", oauth_nonce="442087608", oauth_signature_method="HMAC-SHA1", oauth_consumer_key="5cb447d1fc4c3308e2c6531e45bcadf1", oauth_token="460633205c55d3f1806bcab04174ae09", oauth_timestamp="1334336020", Content-Length=0, Content-Range=bytes */*}
Verb: PUT
Complete URL: http://174.129.125.96:8080/upload?ticket_id=5ea64d64547e38e5e3c121852b2d306d
4639359 != 15125120
verifyUpload to http://174.129.125.96:8080/upload?ticket_id=5ea64d64547e38e5e3c121852b2d306d >>> Response
Success is not true!
Code: 308
Headers: {null=HTTP/1.1 308 Resume Incomplete, Range=bytes=0-4639359, Content-Length=0, Connection=close, Content-Type=text/plain, Server=Vimeo/1.0}
Body:
Then the code goes on to complete the upload and set video information (you can see that in my full code).
Edit 2: Tried removing the "%20" from the content-range and received this error making connection. I must use either "bytes%20" or not add "bytes" at all...
Exception in thread "main" org.scribe.exceptions.OAuthException: Problems while creating connection.
at org.scribe.model.Request.send(Request.java:70)
at org.scribe.model.OAuthRequest.send(OAuthRequest.java:12)
at autouploadermodel.VimeoTest.signAndSendToVimeo(VimeoTest.java:282)
at autouploadermodel.VimeoTest.sendVideoBytes(VimeoTest.java:130)
at autouploadermodel.VimeoTest.sendVideo(VimeoTest.java:105)
at autouploadermodel.VimeoTest.main(VimeoTest.java:62)
Caused by: java.io.IOException: Error writing to server
at sun.net.www.protocol.http.HttpURLConnection.writeRequests(HttpURLConnection.java:622)
at sun.net.www.protocol.http.HttpURLConnection.writeRequests(HttpURLConnection.java:634)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1317)
at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:468)
at org.scribe.model.Response.<init>(Response.java:28)
at org.scribe.model.Request.doSend(Request.java:110)
at org.scribe.model.Request.send(Request.java:62)
... 5 more
Java Result: 1
Edit 1: Updated the code and output. Still need help!

I think your problem could simply be the result of this line:
request.addHeader("Content-Range", "bytes%20" + contentRange);
Try and replace "bytes%20" by simply "bytes "
In your output you see the corresponding header has incorrect content:
Headers: {
Content-Length=15125120,
Content-Type=video/mp4,
Content-Range=bytes%200-10485759/15125120 <-- INCORRECT
}
On the topic of Content-Range...
You're right that an example final block of content should have a range like 14680064-15125119/15125120. That's part of the HTTP 1.1 spec.

Here
String contentRange = Integer.toString(byteNumber + 1);
you start from 1 and not from 0 at the first iteration.
Here
request.addHeader("Content-Length", contentLength);
you put the entire file content length and not the length of the current chunk.

The vimeo API page says:
"The final step is to call vimeo.videos.upload.complete to queue up the video for transcoding. This call will return the video_id, which you can then use in other calls (to set the title, description, privacy, etc.). If you do not call this method, the video will not be processed."
I added this bit of code to the end and got it to work:
request = new OAuthRequest(Verb.PUT, "http://vimeo.com/api/rest/v2");
request.addQuerystringParameter("method", "vimeo.videos.upload.complete");
request.addQuerystringParameter("filename", video.getName());
request.addQuerystringParameter("ticket_id", ticket);
service.signRequest(token, request);
response = request.send();

Check this :
String contentRange="bytes "+lastBytesSend+"-"+ ((totalSize - lastBytesSend)-1)+"/"+totalSize ;
request.addHeader("Content-Range",contentRange);

Related

Why do I get OAuthProblemException error='invalid_request' description='Handle could not be extracted' when doing Exact OAuth

When doing calls to Exact-on-line API to get authenticated we run into the problem that getting the first refresh-token fails. We're not sure why. This is what we get back from Exact:
Http code: 400
JSON Data:
{
error='invalid_request',
description='Handle could not be extracted',
uri='null',
state='null',
scope='null',
redirectUri='null',
responseStatus=400,
parameters={}
}
We use this Java code based on library org.apache.oltu.oauth2.client (1.0.2):
OAuthClientRequest oAuthRequest = OAuthClientRequest //
.tokenLocation(BASE_URL + "/api/oauth2/token") //
.setGrantType(GrantType.AUTHORIZATION_CODE) //
.setClientId(clientId) //
.setClientSecret(clientSecret) //
.setRedirectURI(REDIRECT_URI) //
.setCode(code) //
.buildBodyMessage();
OAuthClient client = new OAuthClient(new URLConnectionClient());
OAuthJSONAccessTokenResponse oauthResponse = client.accessToken(oAuthRequest, OAuth.HttpMethod.POST);
We did do the first step (getting the 'code' as used in setCode(...)) using a localhost-redirect as displayed in https://support.exactonline.com/community/s/knowledge-base#All-All-DNO-Content-gettingstarted There we copy the code from the address-bar of our browser and store it in a place the next computer-step can read it again.
This is due to the fact that the code was copied from your browsers address-bar. There you will find a URL-encoded version of the code (visible in the '%21' often) which when passed into the setCode verbatim will fail the subsequent calls.
Suggestion: URL-decode the value or setup a small temporary localhost-HTTP-server using Undertow or the like to catch the code that was send to you localhost-URL:
Undertow server = Undertow.builder() //
.addHttpListener(7891, "localhost") //
.setHandler(new HttpHandler() {
#Override
public void handleRequest(final HttpServerExchange exchange) throws Exception {
String code = exchange.getQueryParameters().get("code").getFirst();
LOG.info("Recieved code: {}.", code);
LOG.info("Store code");
storeCode(code);
LOG.info("Code stored");
exchange.getResponseHeaders().put(Headers.CONTENT_TYPE, "text/plain");
exchange.getResponseSender().send( //
"Thanks for getting me the code: " + code + "\n" //
+ "Will store it for you and get the first refreshToken..." //
+ "Please have a look at " + OAUTH_STATE_INI
+ " for the new code & refreshToken in a minute" //
);
done.add("done");
}
}).build();
server.start();
NB: Do make sure the redirect URL is correct in your Exact-app-settings

Android & NodeMCU, receiving response from server does not work properly?

I have written an application on Android which realises sending simply requests (using Volley) to the server. The server is stood up on the NodeMCU (ESP8266) microcontroller, written in Lua. The problem is, that after sending the request, application not always is able to print the response. If the address is e.g. "http://www.google.com" it correctly sends request and receive and display response, but if it is the address from the code below - it correctly sends request (the server reacts) but does not (?) receive response (does not display it, displays: "That didn't work!"). Do you have any ideas, how can I fix it and be able to print the response?
Android (part responsible for sending requests):
buttonSynchro.setOnClickListener(new View.OnClickListener() {
#Override
public void onClick(View view) {
// Instantiate the RequestQueue.
String url = "http://192.168.1.12/";
// Request a string response from the provided URL.
StringRequest stringRequest = new StringRequest(Request.Method.GET, url,
new Response.Listener<String>() {
#Override
public void onResponse(String response) {
// Display the first 500 characters of the response string.
testTextView.setText("Response is: "+ response.substring(0,500));
}
}, new Response.ErrorListener() {
#Override
public void onErrorResponse(VolleyError error) {
testTextView.setText("That didn't work!");
}
});
// Add the request to the RequestQueue.
RequestQueue queue = Volley.newRequestQueue(SettingsActivity.this);
queue.add(stringRequest);
}
});
NodeMCU, Lua:
station_cfg={}
station_cfg.ssid="Dom"
station_cfg.pwd="lalala"
wifi.sta.config(station_cfg)
function receive(conn, request)
print(request)
print()
local buf = "";
buf = buf.."<!doctype html><html>";
buf = buf.."<h1> ESP8266 Web Server</h1>";
buf = buf.."</html>";
conn:send(buf);
conn:on("sent", function(sck) sck:close() end);
collectgarbage();
end
function connection(conn)
conn:on("receive", receive)
end
srv=net.createServer(net.TCP, 30)
srv:listen(80, connection)
The code by nPn works in some user agents (Chrome/Firfox/curl/wget on macOS) but not in others (Safari on macOS & iOS, Firefox Klar on iOS). That likely is due to missing HTTP headers.
I advise you stick to the example we have in our documentation at https://nodemcu.readthedocs.io/en/latest/en/modules/net/#netsocketsend.
srv = net.createServer(net.TCP)
function receiver(sck, data)
print(data)
print()
-- if you're sending back HTML over HTTP you'll want something like this instead
local response = {"HTTP/1.0 200 OK\r\nServer: NodeMCU on ESP8266\r\nContent-Type: text/html\r\n\r\n"}
response[#response + 1] = "<!doctype html><html>"
response[#response + 1] = "<h1> ESP8266 Web Server</h1>"
response[#response + 1] = "</html>"
-- sends and removes the first element from the 'response' table
local function send(localSocket)
if #response > 0 then
localSocket:send(table.remove(response, 1))
else
localSocket:close()
response = nil
end
end
-- triggers the send() function again once the first chunk of data was sent
sck:on("sent", send)
send(sck)
end
srv:listen(80, function(conn)
conn:on("receive", receiver)
end)
Also, your code (and nPn's for that matter) makes assumptions about WiFi being available where it shouldn't.
wifi.sta.config(station_cfg) (with auto-connect=true) and wifi.stat.connect are asynchronous and thus non-blocking - as are many other NodeMCU APIs. Hence, you should put the above code into a function and only call it once the device is connected to the AP and got an IP. You do that by e.g. registering a callback for the STA_GOT_IP event with the WiFi event monitor. You'll find a very elaborate example of a boot sequence that listens to all WiFi events at https://nodemcu.readthedocs.io/en/latest/en/upload/#initlua. For starters you may want to trim this and only listen for got-IP.
Based on your comment above and the link you posted showing the traceback, your android app is crashing in the onResponse() method because you are asking for a substring longer than the actual string length.
You can fix this in a number of ways, but one would be to make the ending index be the minimum of the length of the response and 500 (which I assume is the max you can take in your TextView?). You can try changing
testTextView.setText("Response is: "+ response.substring(0,500));
to
testTextView.setText("Response is: "+ response.substring(0, Math.min(response.length(), n)));
or whatever other way you think is more appropriate to limit the length of the response that does not cause the IndexOutOfBoundsException
See the substring method here
public String substring(int beginIndex,
int endIndex)
Returns a new string that is a substring of this string. The substring
begins at the specified beginIndex and extends to the character at
index endIndex - 1. Thus the length of the substring is
endIndex-beginIndex.
Examples:
"hamburger".substring(4, 8) returns "urge"
"smiles".substring(1, 5) returns "mile"
Parameters:
beginIndex - the beginning index, inclusive.
endIndex - the ending index, exclusive. Returns:
the specified substring. Throws:
IndexOutOfBoundsException - if the beginIndex is negative, or endIndex is larger than the length of this String object, or
beginIndex is larger than endIndex.
I am not a Lua expert, but I think you are registering your "sent" callback after you send the response.
I think you should move it into the connection function:
station_cfg={}
station_cfg.ssid="Dom"
station_cfg.pwd="lalala"
wifi.sta.config(station_cfg)
function receive(conn, request)
print(request)
print()
local buf = "";
buf = buf.."<!doctype html><html>";
buf = buf.."<h1> ESP8266 Web Server</h1>";
buf = buf.."</html>";
conn:send(buf);
collectgarbage();
end
function connection(conn)
conn:on("receive", receive)
conn:on("sent", function(sck) sck:close() end);
end
srv=net.createServer(net.TCP, 30)
srv:listen(80, connection)

partial range requests from chrome causing error

I have tried to implement range request video playback on a system that has webservlet and UI, that sends the range request from chrome starting with bytes :0- to the backend dataserver. Now I have been sending the full stream as I was under the impression the jetty server handles the response range. I see it work for the first and second request but then fails as the next request has a range that is less than what the previous range was.
(Request 1) Range:bytes=0-
(Response 1) Accept-Ranges:bytes
Content-Length:6748748
Content-Range:bytes 0-10005/6748748
(Request 2) Range:bytes=6717440-
(Response 2) Accept-Ranges:bytes
Content-Length:6748748
Content-Range:bytes 6717440-6718465/6748748
(Request 3) Range:bytes=3932160-
(Response 3) Accept-Ranges:bytes
Content-Length:6748748
Content-Range:bytes 3932160-3933185/6748748
(Request 4) Range:bytes=5701632-
(Response 4) Fails -
Can anyone make sense of this? With short videos this does not occur, so is there some timeout issue but then why is the chrome request with a smaller range? This is what I specify in the headers but again do not explicitly send the bytes as requested as I thought jetty handles it.
if(inputStream != null) {
if (parameters.containsKey("Range")) {
String range =parameters.get("Range").toString();
String[] ranges = range.split("=")[1].split("-");
final int from = Integer.parseInt(ranges[0]);
if(parameters.containsKey("Content-Length")) {
int sLength = (int) parameters.get("Content-Length");
int to = 10005 + from;
if (to >= sLength) {
to = (int) (sLength - 1);
}
if (ranges.length == 2) {
to = Integer.parseInt(ranges[1]);
}
final String responseRange = String.format("bytes %d-%d/%d", from, to, sLength);
parameters.put("Responserange", responseRange);
}
}
EDIT:
My logs in the dataserver side show the following consistently with each request being handled:
java.nio.channels.ClosedChannelException,
Added additional stack trace
java.nio.channels.ClosedChannelException
at
org.eclipse.jetty.util.IteratingCallback.close(IteratingCallback.java:427)
at org.eclipse.jetty.server.HttpConnection.onClose(HttpConnection.java:489)
at org.eclipse.jetty.io.ssl.SslConnection.onClose(SslConnection.java:217)

java.io.IOException: Incomplete parts with embedded Jetty Server

I`m programming a little file server which gets documents via HTTP-POST requests from another software.
The requests are always "multipart/form-data" types, so I`d like to split it via .getParts();
Unfortunately I always get a "java.io.IOException: Incomplete parts" or it does not find the part.
Is there something wrong with my code or is there a problem with the request?
I`m using a embedded Jetty server with Eclipse
public void create_document() {
String lv_path = gr_request.getParameter("contRep") + File.separator + gr_request.getParameter("docId");
Part lr_part = null;
try {
System.out.println(gr_request.getContentType());
//for testing
Part lr_test = gr_request.getPart("data");
System.out.println("1");
System.out.println(lr_test);
//the actual part
Collection<Part> lr_parts = gr_request.getParts();
for (Iterator<Part> i = lr_parts.iterator(); i.hasNext();) {
lr_part = ((Iterator<Part>) lr_parts).next();
//again for testing
System.out.println("content Type" + lr_part.getContentType());
System.out.println("name" + lr_part.getName());
System.out.println("content Type" + lr_part.getContentType());
String test = lv_path + ".jpg";
lr_part.write(test);
the log is
2017-11-28 11:07:47.941:INFO:oejs.Server:main: jetty-9.0.4.v20130625
2017-11-28 11:07:48.222:INFO:oejs.ServerConnector:main: StartedServerConnector#7165cbeb{HTTP/1.1}{0.0.0.0:1090}
Erkannte Aktion: CREATE_DOCUMENT
2017-11-2811:07:54.469:WARN:oejs.Request:qtp424058530-15:java.io.IOException:Incomplete parts
multipart/form-data; boundary=KoZIhvcNAQcB
1
null
The MultiPartConfig was done by
MultipartConfigElement multipartConfigElement = newMultipartConfigElement((String)null);
ir_request.setAttribute(Request.__MULTIPART_CONFIG_ELEMENT, multipartConfigElement);
Beginning of the body of a transmitted PDF file:
--KoZIhvcNAQcB
Content-Disposition: form-data; filename="data"
X-compId: data
Content-Type: application/pdf
Content-Length: 182370
%PDF-1.7
%ยตยตยตยต
1 0 obj
...and so on...
182188
%%EOF
--KoZIhvcNAQcB--
It seems that there is a problem with the request.
I changed the "filename" tag to "name" while receiving the request.
Now it's running

MD5 calculation for multipart amazon s3 uploading. android/java [duplicate]

Files uploaded to Amazon S3 that are smaller than 5GB have an ETag that is simply the MD5 hash of the file, which makes it easy to check if your local files are the same as what you put on S3.
But if your file is larger than 5GB, then Amazon computes the ETag differently.
For example, I did a multipart upload of a 5,970,150,664 byte file in 380 parts. Now S3 shows it to have an ETag of 6bcf86bed8807b8e78f0fc6e0a53079d-380. My local file has an md5 hash of 702242d3703818ddefe6bf7da2bed757. I think the number after the dash is the number of parts in the multipart upload.
I also suspect that the new ETag (before the dash) is still an MD5 hash, but with some meta data included along the way from the multipart upload somehow.
Does anyone know how to compute the ETag using the same algorithm as Amazon S3?
Say you uploaded a 14MB file to a bucket without server-side encryption, and your part size is 5MB. Calculate 3 MD5 checksums corresponding to each part, i.e. the checksum of the first 5MB, the second 5MB, and the last 4MB. Then take the checksum of their concatenation. MD5 checksums are often printed as hex representations of binary data, so make sure you take the MD5 of the decoded binary concatenation, not of the ASCII or UTF-8 encoded concatenation. When that's done, add a hyphen and the number of parts to get the ETag.
Here are the commands to do it on Mac OS X from the console:
$ dd bs=1m count=5 skip=0 if=someFile | md5 >>checksums.txt
5+0 records in
5+0 records out
5242880 bytes transferred in 0.019611 secs (267345449 bytes/sec)
$ dd bs=1m count=5 skip=5 if=someFile | md5 >>checksums.txt
5+0 records in
5+0 records out
5242880 bytes transferred in 0.019182 secs (273323380 bytes/sec)
$ dd bs=1m count=5 skip=10 if=someFile | md5 >>checksums.txt
2+1 records in
2+1 records out
2599812 bytes transferred in 0.011112 secs (233964895 bytes/sec)
At this point all the checksums are in checksums.txt. To concatenate them and decode the hex and get the MD5 checksum of the lot, just use
$ xxd -r -p checksums.txt | md5
And now append "-3" to get the ETag, since there were 3 parts.
Notes
If you uploaded with aws-cli via aws s3 cp then you most likely have a 8MB chunksize. According to the docs, that is the default.
If the bucket has server-side encryption (SSE) turned on, the ETag won't be the MD5 checksum (see the API documentation). But if you're just trying to verify that an uploaded part matches what you sent, you can use the Content-MD5 header and S3 will compare it for you.
md5 on macOS just writes out the checksum, but md5sum on Linux/brew also outputs the filename. You'll need to strip that, but I'm sure there's some option to only output the checksums. You don't need to worry about whitespace cause xxd will ignore it.
Code Links
A Gist I wrote with a working script for macOS.
The project at s3md5.
Based on answers here, I wrote a Python implementation which correctly calculates both multi-part and single-part file ETags.
def calculate_s3_etag(file_path, chunk_size=8 * 1024 * 1024):
md5s = []
with open(file_path, 'rb') as fp:
while True:
data = fp.read(chunk_size)
if not data:
break
md5s.append(hashlib.md5(data))
if len(md5s) < 1:
return '"{}"'.format(hashlib.md5().hexdigest())
if len(md5s) == 1:
return '"{}"'.format(md5s[0].hexdigest())
digests = b''.join(m.digest() for m in md5s)
digests_md5 = hashlib.md5(digests)
return '"{}-{}"'.format(digests_md5.hexdigest(), len(md5s))
The default chunk_size is 8 MB used by the official aws cli tool, and it does multipart upload for 2+ chunks. It should work under both Python 2 and 3.
bash implementation
python implementation
The algorithm literally is (copied from the readme in the python implementation) :
md5 the chunks
glob the md5 strings together
convert the glob to binary
md5 the binary of the globbed chunk md5s
append "-Number_of_chunks" to the end of the md5 string of the binary
Here's yet another piece in this crazy AWS challenge puzzle.
FWIW, this answer assumes you already have figured out how to calculate the "MD5 of MD5 parts" and can rebuild your AWS Multi-part ETag from all the other answers already provided here.
What this answer addresses is the annoyance of having to "guess" or otherwise "divine" the original upload part size.
We use several different tools for uploading to S3 and they all seem to have different upload part sizes, so "guessing" really wasn't an option. Also, we have a lot of files that were historically uploaded when part sizes seemed to be different. Also, the old trick of using an internal server copy to force the creation of an MD5-type ETag also no longer works as AWS has changed their internal server copies to also use multi-part (just with a fairly large part size).
So...
How can you figure out the object's part size?
Well, if you first make a head_object request and detect that the ETag is a multi-part type ETag (includes a '-<partcount>' at the end), then you can make another head_object request, but with an additional part_number attribute of 1 (the first part). This follow-on head_object request will then return you the content_length of the first part. Viola... Now you know the part size that was used and you can use that size to re-create your local ETag which should match the original uploaded S3 ETag created when the object was uploaded.
Additionally, if you wanted to be exact (perhaps some multi-part uploads were to use variable part sizes), then you could continue to call head_object requests with each part_number specified and calculate each part's MD5 from the returned parts content_length.
Hope that helps...
Not sure if it can help:
We're currently doing an ugly (but so far useful) hack to fix those wrong ETags in multipart uploaded files, which consists on applying a change to the file in the bucket; that triggers a md5 recalculation from Amazon that changes the ETag to matches with the actual md5 signature.
In our case:
File: bucket/Foo.mpg.gpg
ETag obtained: "3f92dffef0a11d175e60fb8b958b4e6e-2"
Do something with the file (rename it, add a meta-data like a fake header, among others)
Etag obtained: "c1d903ca1bb6dc68778ef21e74cc15b0"
We don't know the algorithm, but since we can "fix" the ETag we don't need to worry about it either.
Same algorithm, java version:
(BaseEncoding, Hasher, Hashing, etc comes from the guava library
/**
* Generate checksum for object came from multipart upload</p>
* </p>
* AWS S3 spec: Entity tag that identifies the newly created object's data. Objects with different object data will have different entity tags. The entity tag is an opaque string. The entity tag may or may not be an MD5 digest of the object data. If the entity tag is not an MD5 digest of the object data, it will contain one or more nonhexadecimal characters and/or will consist of less than 32 or more than 32 hexadecimal digits.</p>
* Algorithm follows AWS S3 implementation: https://github.com/Teachnova/s3md5</p>
*/
private static String calculateChecksumForMultipartUpload(List<String> md5s) {
StringBuilder stringBuilder = new StringBuilder();
for (String md5:md5s) {
stringBuilder.append(md5);
}
String hex = stringBuilder.toString();
byte raw[] = BaseEncoding.base16().decode(hex.toUpperCase());
Hasher hasher = Hashing.md5().newHasher();
hasher.putBytes(raw);
String digest = hasher.hash().toString();
return digest + "-" + md5s.size();
}
According to the AWS documentation the ETag isn't an MD5 hash for a multi-part upload nor for an encrypted object: http://docs.aws.amazon.com/AmazonS3/latest/API/RESTCommonResponseHeaders.html
Objects created by the PUT Object, POST Object, or Copy operation, or through the AWS Management Console, and are encrypted by SSE-S3 or plaintext, have ETags that are an MD5 digest of their object data.
Objects created by the PUT Object, POST Object, or Copy operation, or through the AWS Management Console, and are encrypted by SSE-C or SSE-KMS, have ETags that are not an MD5 digest of their object data.
If an object is created by either the Multipart Upload or Part Copy operation, the ETag is not an MD5 digest, regardless of the method of encryption.
In an above answer, someone asked if there was a way to get the md5 for files larger than 5G.
An answer that I could give for getting the MD5 value (for files larger than 5G) would be to either add it manually to the metadata, or use a program to do your uploads which will add the information.
For example, I used s3cmd to upload a file, and it added the following metadata.
$ aws s3api head-object --bucket xxxxxxx --key noarch/epel-release-6-8.noarch.rpm
{
"AcceptRanges": "bytes",
"ContentType": "binary/octet-stream",
"LastModified": "Sat, 19 Sep 2015 03:27:25 GMT",
"ContentLength": 14540,
"ETag": "\"2cd0ae668a585a14e07c2ea4f264d79b\"",
"Metadata": {
"s3cmd-attrs": "uid:502/gname:staff/uname:xxxxxx/gid:20/mode:33188/mtime:1352129496/atime:1441758431/md5:2cd0ae668a585a14e07c2ea4f264d79b/ctime:1441385182"
}
}
It isn't a direct solution using the ETag, but it is a way to populate the metadata you want (MD5) in a way you can access it. It will still fail if someone uploads the file without metadata.
Here is the algorithm in ruby...
require 'digest'
# PART_SIZE should match the chosen part size of the multipart upload
# Set here as 10MB
PART_SIZE = 1024*1024*10
class File
def each_part(part_size = PART_SIZE)
yield read(part_size) until eof?
end
end
file = File.new('<path_to_file>')
hashes = []
file.each_part do |part|
hashes << Digest::MD5.hexdigest(part)
end
multipart_hash = Digest::MD5.hexdigest([hashes.join].pack('H*'))
multipart_etag = "#{multipart_hash}-#{hashes.count}"
Thanks to Shortest Hex2Bin in Ruby and Multipart Uploads to S3 ...
node.js implementation -
const fs = require('fs');
const crypto = require('crypto');
const chunk = 1024 * 1024 * 5; // 5MB
const md5 = data => crypto.createHash('md5').update(data).digest('hex');
const getEtagOfFile = (filePath) => {
const stream = fs.readFileSync(filePath);
if (stream.length <= chunk) {
return md5(stream);
}
const md5Chunks = [];
const chunksNumber = Math.ceil(stream.length / chunk);
for (let i = 0; i < chunksNumber; i++) {
const chunkStream = stream.slice(i * chunk, (i + 1) * chunk);
md5Chunks.push(md5(chunkStream));
}
return `${md5(Buffer.from(md5Chunks.join(''), 'hex'))}-${chunksNumber}`;
};
And here is a PHP version of calculating the ETag:
function calculate_aws_etag($filename, $chunksize) {
/*
DESCRIPTION:
- calculate Amazon AWS ETag used on the S3 service
INPUT:
- $filename : path to file to check
- $chunksize : chunk size in Megabytes
OUTPUT:
- ETag (string)
*/
$chunkbytes = $chunksize*1024*1024;
if (filesize($filename) < $chunkbytes) {
return md5_file($filename);
} else {
$md5s = array();
$handle = fopen($filename, 'rb');
if ($handle === false) {
return false;
}
while (!feof($handle)) {
$buffer = fread($handle, $chunkbytes);
$md5s[] = md5($buffer);
unset($buffer);
}
fclose($handle);
$concat = '';
foreach ($md5s as $indx => $md5) {
$concat .= hex2bin($md5);
}
return md5($concat) .'-'. count($md5s);
}
}
$etag = calculate_aws_etag('path/to/myfile.ext', 8);
And here is an enhanced version that can verify against an expected ETag - and even guess the chunksize if you don't know it!
function calculate_etag($filename, $chunksize, $expected = false) {
/*
DESCRIPTION:
- calculate Amazon AWS ETag used on the S3 service
INPUT:
- $filename : path to file to check
- $chunksize : chunk size in Megabytes
- $expected : verify calculated etag against this specified etag and return true or false instead
- if you make chunksize negative (eg. -8 instead of 8) the function will guess the chunksize by checking all possible sizes given the number of parts mentioned in $expected
OUTPUT:
- ETag (string)
- or boolean true|false if $expected is set
*/
if ($chunksize < 0) {
$do_guess = true;
$chunksize = 0 - $chunksize;
} else {
$do_guess = false;
}
$chunkbytes = $chunksize*1024*1024;
$filesize = filesize($filename);
if ($filesize < $chunkbytes && (!$expected || !preg_match("/^\\w{32}-\\w+$/", $expected))) {
$return = md5_file($filename);
if ($expected) {
$expected = strtolower($expected);
return ($expected === $return ? true : false);
} else {
return $return;
}
} else {
$md5s = array();
$handle = fopen($filename, 'rb');
if ($handle === false) {
return false;
}
while (!feof($handle)) {
$buffer = fread($handle, $chunkbytes);
$md5s[] = md5($buffer);
unset($buffer);
}
fclose($handle);
$concat = '';
foreach ($md5s as $indx => $md5) {
$concat .= hex2bin($md5);
}
$return = md5($concat) .'-'. count($md5s);
if ($expected) {
$expected = strtolower($expected);
$matches = ($expected === $return ? true : false);
if ($matches || $do_guess == false || strlen($expected) == 32) {
return $matches;
} else {
// Guess the chunk size
preg_match("/-(\\d+)$/", $expected, $match);
$parts = $match[1];
$min_chunk = ceil($filesize / $parts /1024/1024);
$max_chunk = floor($filesize / ($parts-1) /1024/1024);
$found_match = false;
for ($i = $min_chunk; $i <= $max_chunk; $i++) {
if (calculate_aws_etag($filename, $i) === $expected) {
$found_match = true;
break;
}
}
return $found_match;
}
} else {
return $return;
}
}
}
The short answer is that you take the 128bit binary md5 digest of each part, concatenate them into a document, and hash that document. The algorithm presented in this answer is accurate.
Note: the multipart ETAG form with the hyphen will change to the form without the hyphen if you "touch" the blob (even without modifying the content). That is, if you copy, or do an in-place copy of your completed multipart-uploaded object (aka PUT-COPY), S3 will recompute the ETAG with the simple version of the algorithm. i.e. the destination object will have an etag without the hyphen.
You've probably considered this already, but if your files are less than 5GB, and you already know their MD5s, and upload parallelization provides little to no benefit (e.g. you are streaming the upload from a slow network, or uploading from a slow disk), then you may also consider using a simple PUT instead of a multipart PUT, and pass your known Content-MD5 in your request headers -- amazon will fail the upload if they don't match. Keep in mind that you get charged for each UploadPart.
Furthermore, in some clients, passing a known MD5 for the input of a PUT operation will save the client from recomputing the MD5 during the transfer. In boto3 (python), you would use the ContentMD5 parameter of the client.put_object() method, for instance. If you omit the parameter, and you already knew the MD5, then the client would be wasting cycles computing it again before the transfer.
Working algorithm implemented in Node.js (TypeScript).
/**
* Generate an S3 ETAG for multipart uploads in Node.js
* An implementation of this algorithm: https://stackoverflow.com/a/19896823/492325
* Author: Richard Willis <willis.rh#gmail.com>
*/
import fs from 'node:fs';
import crypto, { BinaryLike } from 'node:crypto';
const defaultPartSizeInBytes = 5 * 1024 * 1024; // 5MB
function md5(contents: string | BinaryLike): string {
return crypto.createHash('md5').update(contents).digest('hex');
}
export function getS3Etag(
filePath: string,
partSizeInBytes = defaultPartSizeInBytes
): string {
const { size: fileSizeInBytes } = fs.statSync(filePath);
let parts = Math.floor(fileSizeInBytes / partSizeInBytes);
if (fileSizeInBytes % partSizeInBytes > 0) {
parts += 1;
}
const fileDescriptor = fs.openSync(filePath, 'r');
let totalMd5 = '';
for (let part = 0; part < parts; part++) {
const skipBytes = partSizeInBytes * part;
const totalBytesLeft = fileSizeInBytes - skipBytes;
const bytesToRead = Math.min(totalBytesLeft, partSizeInBytes);
const buffer = Buffer.alloc(bytesToRead);
fs.readSync(fileDescriptor, buffer, 0, bytesToRead, skipBytes);
totalMd5 += md5(buffer);
}
const combinedHash = md5(Buffer.from(totalMd5, 'hex'));
const etag = `${combinedHash}-${parts}`;
return etag;
}
I've published this to npm
npm install s3-etag
import { generateETag } from 's3-etag';
const etag = generateETag(absoluteFilePath, partSizeInBytes);
View project here: https://github.com/badsyntax/s3-etag
A version in Rust:
use crypto::digest::Digest;
use crypto::md5::Md5;
use std::fs::File;
use std::io::prelude::*;
use std::iter::repeat;
fn calculate_etag_from_read(f: &mut dyn Read, chunk_size: usize) -> Result<String> {
let mut md5 = Md5::new();
let mut concat_md5 = Md5::new();
let mut input_buffer = vec![0u8; chunk_size];
let mut chunk_count = 0;
let mut current_md5: Vec<u8> = repeat(0).take((md5.output_bits() + 7) / 8).collect();
let md5_result = loop {
let amount_read = f.read(&mut input_buffer)?;
if amount_read > 0 {
md5.reset();
md5.input(&input_buffer[0..amount_read]);
chunk_count += 1;
md5.result(&mut current_md5);
concat_md5.input(&current_md5);
} else {
if chunk_count > 1 {
break format!("{}-{}", concat_md5.result_str(), chunk_count);
} else {
break md5.result_str();
}
}
};
Ok(md5_result)
}
fn calculate_etag(file: &String, chunk_size: usize) -> Result<String> {
let mut f = File::open(file)?;
calculate_etag_from_read(&mut f, chunk_size)
}
See a repo with a simple implementation: https://github.com/bn3t/calculate-etag/tree/master
Regarding chunk size, I noticed that it seems to depend of number of parts.
The maximun number of parts are 10000 as AWS documents.
So starting on a default of 8MB and knowing the filesize, chunk size and parts can be calculated as follows:
chunk_size=8*1024*1024
flsz=os.path.getsize(fl)
while flsz/chunk_size>10000:
chunk_size*=2
parts=math.ceil(flsz/chunk_size)
Parts have to be up-rounded
Extending Timothy Gonzalez's answer:
Identical files will have different etag when using multipart upload.
It's easy to test it with WinSCP, because it uses multipart upload.
When I upload multiple indentical copies of the same file to S3 via WinSCP then each has different etag. When I download them and calculate md5, then they are still indentical.
So from what I tested different etags doesn't mean that files are different.
I see no alternative way to obtain any hash for S3 files without downloading them first.
This is true for multipart uploads. For not-multipart it should still be possible to calculate etag locally.
I have a solution for iOS and macOS without using external helpers like dd and xxd. I have just found it, so I report it as it is, planning to improve it at a later stage. For the moment, it relies on both Objective-C and Swift code. First of all, create this helper class in Objective-C:
AWS3MD5Hash.h
#import <Foundation/Foundation.h>
NS_ASSUME_NONNULL_BEGIN
#interface AWS3MD5Hash : NSObject
- (NSData *)dataFromFile:(FILE *)theFile startingOnByte:(UInt64)startByte length:(UInt64)length filePath:(NSString *)path singlePartSize:(NSUInteger)partSizeInMb;
- (NSData *)dataFromBigData:(NSData *)theData startingOnByte:(UInt64)startByte length:(UInt64)length;
- (NSData *)dataFromHexString:(NSString *)sourceString;
#end
NS_ASSUME_NONNULL_END
AWS3MD5Hash.m
#import "AWS3MD5Hash.h"
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define SIZE 256
#implementation AWS3MD5Hash
- (NSData *)dataFromFile:(FILE *)theFile startingOnByte:(UInt64)startByte length:(UInt64)length filePath:(NSString *)path singlePartSize:(NSUInteger)partSizeInMb {
char *buffer = malloc(length);
NSURL *fileURL = [NSURL fileURLWithPath:path];
NSNumber *fileSizeValue = nil;
NSError *fileSizeError = nil;
[fileURL getResourceValue:&fileSizeValue
forKey:NSURLFileSizeKey
error:&fileSizeError];
NSInteger __unused result = fseek(theFile,startByte,SEEK_SET);
if (result != 0) {
free(buffer);
return nil;
}
NSInteger result2 = fread(buffer, length, 1, theFile);
NSUInteger difference = fileSizeValue.integerValue - startByte;
NSData *toReturn;
if (result2 == 0) {
toReturn = [NSData dataWithBytes:buffer length:difference];
} else {
toReturn = [NSData dataWithBytes:buffer length:result2 * length];
}
free(buffer);
return toReturn;
}
- (NSData *)dataFromBigData:(NSData *)theData startingOnByte: (UInt64)startByte length:(UInt64)length {
NSUInteger fileSizeValue = theData.length;
NSData *subData;
if (startByte + length > fileSizeValue) {
subData = [theData subdataWithRange:NSMakeRange(startByte, fileSizeValue - startByte)];
} else {
subData = [theData subdataWithRange:NSMakeRange(startByte, length)];
}
return subData;
}
- (NSData *)dataFromHexString:(NSString *)string {
string = [string lowercaseString];
NSMutableData *data= [NSMutableData new];
unsigned char whole_byte;
char byte_chars[3] = {'\0','\0','\0'};
NSInteger i = 0;
NSInteger length = string.length;
while (i < length-1) {
char c = [string characterAtIndex:i++];
if (c < '0' || (c > '9' && c < 'a') || c > 'f')
continue;
byte_chars[0] = c;
byte_chars[1] = [string characterAtIndex:i++];
whole_byte = strtol(byte_chars, NULL, 16);
[data appendBytes:&whole_byte length:1];
}
return data;
}
#end
Now create a plain swift file:
AWS Extensions.swift
import UIKit
import CommonCrypto
extension URL {
func calculateAWSS3MD5Hash(_ numberOfParts: UInt64) -> String? {
do {
var fileSize: UInt64!
var calculatedPartSize: UInt64!
let attr:NSDictionary? = try FileManager.default.attributesOfItem(atPath: self.path) as NSDictionary
if let _attr = attr {
fileSize = _attr.fileSize();
if numberOfParts != 0 {
let partSize = Double(fileSize / numberOfParts)
var partSizeInMegabytes = Double(partSize / (1024.0 * 1024.0))
partSizeInMegabytes = ceil(partSizeInMegabytes)
calculatedPartSize = UInt64(partSizeInMegabytes)
if calculatedPartSize % 2 != 0 {
calculatedPartSize += 1
}
if numberOfParts == 2 || numberOfParts == 3 { // Very important when there are 2 or 3 parts, in the majority of times
// the calculatedPartSize is already 8. In the remaining cases we force it.
calculatedPartSize = 8
}
if mainLogToggling {
print("The calculated part size is \(calculatedPartSize!) Megabytes")
}
}
}
if numberOfParts == 0 {
let string = self.memoryFriendlyMd5Hash()
return string
}
let hasher = AWS3MD5Hash.init()
let file = fopen(self.path, "r")
defer { let result = fclose(file)}
var index: UInt64 = 0
var bigString: String! = ""
var data: Data!
while autoreleasepool(invoking: {
if index == (numberOfParts-1) {
if mainLogToggling {
//print("Siamo all'ultima linea.")
}
}
data = hasher.data(from: file!, startingOnByte: index * calculatedPartSize * 1024 * 1024, length: calculatedPartSize * 1024 * 1024, filePath: self.path, singlePartSize: UInt(calculatedPartSize))
bigString = bigString + MD5.get(data: data) + "\n"
index += 1
if index == numberOfParts {
return false
}
return true
}) {}
let final = MD5.get(data :hasher.data(fromHexString: bigString)) + "-\(numberOfParts)"
return final
} catch {
}
return nil
}
func memoryFriendlyMd5Hash() -> String? {
let bufferSize = 1024 * 1024
do {
// Open file for reading:
let file = try FileHandle(forReadingFrom: self)
defer {
file.closeFile()
}
// Create and initialize MD5 context:
var context = CC_MD5_CTX()
CC_MD5_Init(&context)
// Read up to `bufferSize` bytes, until EOF is reached, and update MD5 context:
while autoreleasepool(invoking: {
let data = file.readData(ofLength: bufferSize)
if data.count > 0 {
data.withUnsafeBytes {
_ = CC_MD5_Update(&context, $0, numericCast(data.count))
}
return true // Continue
} else {
return false // End of file
}
}) { }
// Compute the MD5 digest:
var digest = Data(count: Int(CC_MD5_DIGEST_LENGTH))
digest.withUnsafeMutableBytes {
_ = CC_MD5_Final($0, &context)
}
let hexDigest = digest.map { String(format: "%02hhx", $0) }.joined()
return hexDigest
} catch {
print("Cannot open file:", error.localizedDescription)
return nil
}
}
struct MD5 {
static func get(data: Data) -> String {
var digest = [UInt8](repeating: 0, count: Int(CC_MD5_DIGEST_LENGTH))
let _ = data.withUnsafeBytes { bytes in
CC_MD5(bytes, CC_LONG(data.count), &digest)
}
var digestHex = ""
for index in 0..<Int(CC_MD5_DIGEST_LENGTH) {
digestHex += String(format: "%02x", digest[index])
}
return digestHex
}
// The following is a memory friendly version
static func get2(data: Data) -> String {
var currentIndex = 0
let bufferSize = 1024 * 1024
//var digest = [UInt8](repeating: 0, count: Int(CC_MD5_DIGEST_LENGTH))
// Create and initialize MD5 context:
var context = CC_MD5_CTX()
CC_MD5_Init(&context)
while autoreleasepool(invoking: {
var subData: Data!
if (currentIndex + bufferSize) < data.count {
subData = data.subdata(in: Range.init(NSMakeRange(currentIndex, bufferSize))!)
currentIndex = currentIndex + bufferSize
} else {
subData = data.subdata(in: Range.init(NSMakeRange(currentIndex, data.count - currentIndex))!)
currentIndex = currentIndex + (data.count - currentIndex)
}
if subData.count > 0 {
subData.withUnsafeBytes {
_ = CC_MD5_Update(&context, $0, numericCast(subData.count))
}
return true
} else {
return false
}
}) { }
// Compute the MD5 digest:
var digest = Data(count: Int(CC_MD5_DIGEST_LENGTH))
digest.withUnsafeMutableBytes {
_ = CC_MD5_Final($0, &context)
}
var digestHex = ""
for index in 0..<Int(CC_MD5_DIGEST_LENGTH) {
digestHex += String(format: "%02x", digest[index])
}
return digestHex
}
}
Now add:
#import "AWS3MD5Hash.h"
to your Objective-C Bridging header. You should be ok with this setup.
Example usage
To test this setup, you could be calling the following method inside the object that is in charge of handling the AWS connections:
func getMd5HashForFile() {
let credentialProvider = AWSCognitoCredentialsProvider(regionType: AWSRegionType.USEast2, identityPoolId: "<INSERT_POOL_ID>")
let configuration = AWSServiceConfiguration(region: AWSRegionType.APSoutheast2, credentialsProvider: credentialProvider)
configuration?.timeoutIntervalForRequest = 3.0
configuration?.timeoutIntervalForResource = 3.0
AWSServiceManager.default().defaultServiceConfiguration = configuration
AWSS3.register(with: configuration!, forKey: "defaultKey")
let s3 = AWSS3.s3(forKey: "defaultKey")
let headObjectRequest = AWSS3HeadObjectRequest()!
headObjectRequest.bucket = "<NAME_OF_YOUR_BUCKET>"
headObjectRequest.key = self.latestMapOnServer.key
let _: AWSTask? = s3.headObject(headObjectRequest).continueOnSuccessWith { (awstask) -> Any? in
let headObjectOutput: AWSS3HeadObjectOutput? = awstask.result
var ETag = headObjectOutput?.eTag!
// Here you should parse the returned Etag and extract the number of parts to provide to the helper function. Etags end with a "-" followed by the number of parts. If you don't see this format, then pass 0 as the number of parts.
ETag = ETag!.replacingOccurrences(of: "\"", with: "")
print("headObjectOutput.ETag \(ETag!)")
let mapOnDiskUrl = self.getMapsDirectory().appendingPathComponent(self.latestMapOnDisk!)
let hash = mapOnDiskUrl.calculateAWSS3MD5Hash(<Take the number of parts from the ETag returned by the server>)
if hash == ETag {
print("They are the same.")
}
print ("\(hash!)")
return nil
}
}
If the ETag returned by the server does not have "-" at the end of the ETag, just pass 0 to calculateAWSS3MD5Hash. Please comment if you encounter any problems. I am working on a swift only solution, I will update this answer as soon as I finish. Thanks
I just saw that the AWS S3 Console 'upload' uses an unusual part (chunk) size of 17,179,870 - at least for larger files.
Using that part size gave me the correct ETag hash using the methods described earlier. Thanks to #TheStoryCoder for the php version.
Thanks to #hans for his idea to use head-object to see the actual sizes of each part.
I used the AWS S3 Console (on Nov28 2020) to upload about 50 files ranging in size from 190MB to 2.3GB and all of them had the same part size of 17,179,870.
I liked Emerson's leading answer above - especially the xxd part - but I was too lazy to use dd so I went with split, guessing at an 8M chunk size because I uploaded with aws s3 cp:
$ split -b 8M large.iso XXX
$ md5sum XXX* > checksums.txt
$ sed -i 's/ .*$//' checksums.txt
$ xxd -r -p checksums.txt | md5sum
99a090df013d375783f0f0be89288529 -
$ wc -l checksums.txt
80 checksums.txt
$
It was immediately obvious that both parts of my S3 etag matched my file's calculated etag.
UPDATE:
This has been working nicely:
$ ll large.iso
-rw-rw-r-- 1 user user 669134848 Apr 12 2021 large.iso
$
$ etag large.iso
99a090df013d375783f0f0be89288529-80
$
$ type etag
etag is a function
etag ()
{
split -b 8M --filter=md5sum $1 | cut -d' ' -f1 | pee "xxd -r -p | md5sum | cut -d' ' -f1" "wc -l" | paste -d'-' - -
}
$
All the other answers assume a standard and regular part size. But that assumption may not be true. Across the console and various SDKs there are different defaults. And the low-level API does allow a lot of variety.
Complications:
S3 multi-part uploads can have parts of any size (within a min and max for non-last parts).
Even the non-last parts can be different sizes.
When you upload they don't have to be consecutive part numbers.
If you do a multi-part upload with only 1 part, the etag is the more complicated version, not the simple MD5
etags tend to be wrapped in double-quotes. I don't know why. But that's just a thing that might trip you up.
So we need find find out how many parts there are, and how big they are.
You cannot reliably get the part count from boto3's Object.parts_count attribute. I don't know if the same is true of other SDKs.
The get_object_attributes API documentation claims that it returns a list of parts and sizes. But when I tested those fields were missing. Even for multi-part uploads that were not completed.
Even if you assume equal part sizes (except the last part), you cannot deduce part size from content length and part count. e.g. if a 90MB file has 3 parts, was that 30MBx3, or 40MB+40MB+10MB?
Let's assume that you have a local file and you want to check whether it matches the content of the object in S3.
(And assume that you've already checked whether the lengths differ, because that's a faster check.)
Here's a python3 script to do that. (I chose python just because that's what I'm familiar with.)
We use head_object to get the e-tag. With the e-tag we can deduce whether it was a single-part upload or multi-part, and how many parts.
We use head_object passing in PartNumber, calling that for each part, to get the length of each part. You could use multiprocessing to speed that up. (Noting that boto3's client should not be passed between processes.)
import boto3
from hashlib import md5
def content_matches(local_path, bucket, key) -> bool:
client = boto3.client('s3')
resp = client.head_object(Bucket=bucket, Key=key)
remote_e_tag = resp['ETag']
total_length = resp['ContentLength']
if '-' not in remote_e_tag:
# it was a single-part upload
m = md5()
# you could read from the file in chunks to avoid loading the whole thing into memory
# the chunks would not have to match any SDK standard. It can be whatever you want.
# (The MD5 library will act as if you hashed in one go)
with open(file, 'rb') as f:
local_etag = f'"md5(f.read()).hexdigest()"'
return local_etag == remote_e_tag
else:
# multi-part upload
# to find the number of parts, get it from the e-tag
# e.g. 123-56 has 56 parts
num_parts = int(remote_e_tag.strip('"').split('-')[-1])
print(f"Assuming {num_parts=} from {remote_e_tag=}")
md5s = []
with open(local_path, 'rb') as f:
sz_read = 0
for part_num in range(1,num_parts+1):
resp = client.head_object(Bucket=bucket, Key=key, PartNumber=part_num)
sz_read += resp['ContentLength']
local_data_part = f.read(resp['ContentLength'])
assert len(local_data_part) == resp['ContentLength'] # sanity check
md5s.append(md5(local_data_part))
assert sz_read == total_length, "Sum of part sizes doesn't equal total file size"
digests = b''.join(m.digest() for m in md5s)
digests_md5 = md5(digests)
local_etag = f'"{digests_md5.hexdigest()}-{len(md5s)}"'
return remote_e_tag == local_etag
And a script to test it with all those edge cases:
import boto3
from pprint import pprint
from hashlib import md5
from main import content_matches
MB = 2 ** 20
bucket = 'mybucket'
key = 'test-multi-part-upload'
local_path = 'test-data'
# first upload the object
s3 = boto3.resource('s3')
obj = s3.Object(bucket, key)
mpu = obj.initiate_multipart_upload()
parts = []
part_sizes = [6 * MB, 5 * MB, 5] # deliberately non-standard and not consistent
upload_part_nums = [1,3,8] # test non-consecutive part numbers for upload
with open(local_path, 'wb') as fw:
with open('/dev/random', 'rb') as fr:
for (part_num, part_size) in zip(upload_part_nums, part_sizes):
part = mpu.Part(part_num)
data = fr.read(part_size)
print(f"Uploading part {part_num}")
resp = part.upload(Body=data)
parts.append({
'ETag': resp['ETag'],
'PartNumber': part_num
})
fw.write(data)
resp = mpu.complete(MultipartUpload={
'Parts': parts
})
obj.reload()
assert content_matches(local_path, bucket, key)
"#wim Any idea how to calculate the ETag when SSE is enabled?"
in my testing, multipart+SEE-C, the Etag is valid.
can be calculated from the individual Etag returned for each part.
and this is easy to prove.
let's say we have a multipart upload with SEE-C, with 10 parts.
take the 10 Etags, put them in a file, and run "xxd -r -p checksums.txt | md5sum", the calculdated value with match the value returned from aws
etag parts
-------------------------------
1330e1275b556ab6702bca9438f62c15 -
ae55d3ddf52e33d45140a5be6dacb925 -
16dc956e05962b84ad9cd74a05e86797 -
64be66992a5110c4b1151a8249258a1a -
4926df0200fe24499524176d6a85e347 -
2b6655c3506481eb1fae6b2e2e7c4b8b -
a02e9dbd49039eaf4d6de1fddc5e1a30 -
afb7bc1f6e0c1f23671cb7116f3b0c63 -
dddf3a1ab192f26bb483a3e2778bab13 -
adb8b2b761640418856853f3810ac45a -
-------------------------------
etag_from_aws = c68db040f8a36c164259bcca40c36410-10
etag_calculated = c68db040f8a36c164259bcca40c36410-10
No,
Till now there is not solution to match normal file ETag and Multipart file ETag and MD5 of local file.

Categories