There's a web page with a search engine:
http://www.nukat.edu.pl/cgi-bin/gw_48_1_12/chameleon?sessionid=2010010122520520752&skin=default&lng=pl&inst=consortium&search=KEYWORD&function=SEARCHSCR&SourceScreen=NOFUNC&elementcount=1&pos=1&submit=TabData
I want to use its search engine from a java application.
Currently I'm trying to send a very simple request - only one field filled and no logical operators.
This is my code:
try {
URL url = new URL( nukatSearchUrl );
URLConnection urlConn = url.openConnection();
urlConn.setDoInput( true );
urlConn.setDoOutput( true );
urlConn.setUseCaches( false );
urlConn.setRequestProperty( "Content-Type", "application/x-www-form-urlencoded" );
BufferedWriter out = new BufferedWriter( new OutputStreamWriter( urlConn.getOutputStream() ) );
String content = "t1=" + URLEncoder.encode( "Duma Key", "UTF-8" );
out.write( content );
out.flush();
out.close();
BufferedReader in = new BufferedReader( new InputStreamReader( urlConn.getInputStream() ) );
String rcv = null;
while ( ( rcv = in.readLine() ) != null ) {
System.out.println( rcv );
}
fd.close();
in.close();
} catch ( Exception ex ) {
throw new SearchEngineException( "NukatSearchEngine.search() : " + ex.getMessage() );
}
Unfortunateley what I keep getting is the main site - looks like this:
<cant post the link to the main site :/>
Not the search results I'm expecting.
What could be wrong here?
I wound't go any further with this after reading BalusC's answer. Here are, however, a few pointers, if you don't worry of being blacklisted:
set the User-Agent header to pretend being a browser. for example
urlConn.setRequestProperty("User-Agent",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6 GTB6");
you can use a simulation of a human user in firefox, using Selenium WebDriver
The URL may be wrong or your request is likely incomplete. You need to check the HTML source (rightclick page > View Source) and use the same URL as definied in the <form action> and gather all request parameters (including those from hidden input fields and the button which you intend to "press"!) for use in your query string.
That said, doing so is in most cases a policy violation and may result in your IP become blacklisted. Please check their robots.txt and the "Terms of use" -if any, I don't understand Polish. Their robots.txt at least says that everyone is disallowed to access the entire website programmatically. Use it on your own risks. You've been warned. Better contact them and ask if they have any public webservice and then use it instead.
You can always spoof the user-agent request header with a real-looking string as extracted from a real webbrowser to minimize the risk to get recognized as a bot as pointed out by Bozho here, but you can still get caught on based on the visitor patterns/statistics.
An easy way to see all activity that you need to replicate is the Live HTTP Headers Firefox Extension. To see all form elements on the page, Firebug is useful. Finally, I often use a fake server that I control to see what the browser is sending, and compare to my application. I rolled my own, just a small Java server that prints out everything sent to it - inverse telnet, if you will.
Another note is that some sites deny access based on the User-Agent, i.e. you might need to get your application to pretend it's Firefox. This is very bad practice, and a little dishonest. As BalusC mentioned, check their usage policy and robots.txt! I would also recommend asking permission if you intend to spread your application around.
Finally, I happen to be working on something similar and you might find the following code useful (it writes a mapping of key -> lists of values to the correct POST format):
StringBuilder builder = new StringBuilder();
try {
boolean first = false;
for(Entry<String,List<String>> entry : data.entrySet()) {
for(String value : entry.getValue()) {
if(first) {
first = false;
}
else {
builder.append("&");
}
builder.append(URLEncoder.encode(entry.getKey(), "UTF-8") + "=" + URLEncoder.encode(value, "UTF-8"));
}
}
} catch (UnsupportedEncodingException e1) {
return false;
}
conn.setDoOutput(true);
try {
OutputStreamWriter wr = new OutputStreamWriter(conn.getOutputStream());
wr.write(builder.toString());
wr.flush();
conn.connect();
} catch (IOException e) {
return(false);
}
As well as the user-agent it could also be using cookies to check that the search is being sent from the search page.
HttpClient is good for automating form submission including handling any cookies and pretending to be a browser.
Related
I have a problem when i trying to scrape a price from dynamically updated web pages. I mean that lion's share of html code doesn't received using ways like UrlConnection, Jsoup, HtmlUnit.
I don't know really much about web scraping, but I guess that problem is that internet shops like these:
Auchan,
Silpo
use javascript and ajax to load main info about products. And in my opinion, the problem is in redirecting or deley that doesn't allow to get full loaded html file with all needed data.
So, the question is how to scrape price from links above?
I have already tried several approaches:
UrlConnection
URL url;
try {
url = new URL("https://auchan.ua/govjadina-v-kartofel-nom-pjure-so-svekloj-hipp-6440-220-g-297668/");
URLConnection con = url.openConnection();
InputStream is = con.getInputStream();
BufferedReader br = new BufferedReader(new InputStreamReader(is));
String line;
try(FileWriter fileWriter = new FileWriter("output.html")){
while ((line = br.readLine()) != null) {
fileWriter.write(line+"\n");
}
}
} catch (IOException e) {
e.printStackTrace();
}
Runs good, but return html without price data.
Jsoup
Document document = null;
String link = "https://auchan.ua/govjadina-v-kartofel-nom-pjure-so-svekloj-hipp-6440-220-g-297668/";
try {
document = Jsoup.connect(link).get();
} catch (IOException e) {
e.printStackTrace();
}
if (document != null) {
try (FileWriter fileWriter = new FileWriter("output.html")) {
fileWriter.write(document.toString());
} catch (IOException e) {
e.printStackTrace();
}
}
Returns the same.
3.HtmlUnit
String link = "https://auchan.ua/govjadina-v-kartofel-nom-pjure-so-svekloj-hipp-6440-220-g-297668/";
WebClient webClient = new WebClient(BrowserVersion.CHROME);
webClient.getOptions().setJavaScriptEnabled(true);
webClient.getOptions().setThrowExceptionOnScriptError(false);
webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
webClient.setAjaxController(new NicelyResynchronizingAjaxController());
webClient.waitForBackgroundJavaScriptStartingBefore(5000);
HtmlPage htmlPage = null;
try {
htmlPage = webClient.getPage(link);
webClient.waitForBackgroundJavaScript(5000);
} catch (IOException e) {
e.printStackTrace();
}
if (htmlPage!=null){
try (FileWriter fileWriter = new FileWriter("output.html")) {
fileWriter.write(Jsoup.parse(htmlPage.asXml()).toString());
} catch (IOException e) {
e.printStackTrace();
}
}
Returns a little bit more, including some javascripts tags, but still nothing usefull. Also, this code above throws so many exceptions, that they don't even fit in console.
I also tried to set up agents like this:
java.net.URLConnection conn = url.openConnection();
conn.setRequestProperty("User-Agent", "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.4; en-US; rv:1.9.2.2) Gecko/20100316 Firefox/3.6.2");
and this:
System.setProperty("http.agent", "")
You need to use Chrome's Dev tools to view the HTTP requests/responses
The page loads up tons of javascript. This in turn churns out a whole load of HTTP requests and waits for the responses: the first that looks interesting is:
https://auchan.ua/graphql which is a POST request with an important http header referer: https://auchan.ua/govjadina-v-kartofel-nom-pjure-so-svekloj-hipp-6440-220-g-297668/ - The response body for the request is: {"data":{"urlResolver":{"type":"PRODUCT","id":297668}}}
Taking the product ID value and searching for it in the subsequent responses I found the product ID was contained. The responses are all escaped unicode characters but if you open the URLs in a browser the content is rendered.
This particular URL that starts with auchan.ua/graphql/?query=query%20getProductDetail... looked promising and sure enough the special_price matches whats displayed on the page. So you'd need to find a way of generating/extracting these URLs from the initial page source.
link to product details
You may also find this response I gave useful for processing JSON data.
The second shop you linked to requires a username/password but the process for getting the data will likely be very similar; use dev tools to view the http requests, work out where the price info is coming from (find the value in one of the responses) then try to recreate the same request from the initial URL and the response returned.
Good luck!
I'm trying to read out the code of a website.
But there is an issue if I want to receive the code of this site for example: "https://www.amazon.de/gp/bestsellers/pet-supplies/#2"
I tried a lot, but still im just receiving the code of https://www.amazon.de/gp/bestsellers/pet-supplies". So something does not work right as I want to receive place 21-40 and not 1-20.
I'm using an URLConneciton and a BufferedReader:
public String fetchPage(String urlS){
String s = null;
String qc = null;
try{
URL url = new URL(urlS);
URLConnection uc = url.openConnection();
uc.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:51.0) Gecko/20100101 Firefox/51.0");
BufferedReader reader = new BufferedReader(new InputStreamReader(uc.getInputStream()));
while((s = reader.readLine()) != null){
qc += s;
}
reader.close();
} catch(IOException e) {
e.printStackTrace();
qc = "receiving qc failed";
}
return qc;
}
Thank you in advance for your effort :)
The URL you're fetching, contains an achor (the #2 at the end). An anchor is a client-side concept and is originally used to jump to a certain part of the page. Some webapps (mostly single-page apps) use the anchor to keep track of some sort of state (eg. what page of products you're viewing).
Since the anchor is a client side concept, the responding webserver (or your browser/HTTP client library) just drops any anchors as if you actually requested https://www.amazon.de/gp/bestsellers/pet-supplies.
Bottom line is that you'll never get the second page... Goog luck in scraping Amazon though ;)
I am trying to retrieve the html of a Google search query result using Java. That is, if I do a search in Google.com for a particular phrase, I would like to retrieve the html of the resulting web page (the page containing the links to possible matches along with their descriptions, URLs, ect…).
I tried doing this using the following code that I found in a related post:
import java.io.*;
import java.net.*;
import java.util.*;
public class Main {
public static void main (String args[]) {
URL url;
InputStream is = null;
DataInputStream dis;
String line;
try {
url = new URL("https://www.google.com/#hl=en&output=search&sclient=psy-ab&q=UCF&oq=UCF&aq=f&aqi=g4&aql=&gs_l=hp.3..0l4.1066.1471.0.1862.3.3.0.0.0.0.382.1028.2-1j2.3.0...0.0.OxbV2LOXcaY&pbx=1&bav=on.2,or.r_gc.r_pw.r_cp.r_qf.,cf.osb&fp=579625c09319dd01&biw=944&bih=951");
is = url.openStream(); // throws an IOException
dis = new DataInputStream(new BufferedInputStream(is));
while ((line = dis.readLine()) != null) {
System.out.println(line);
}
} catch (MalformedURLException mue) {
mue.printStackTrace();
} catch (IOException ioe) {
ioe.printStackTrace();
} finally {
try {
is.close();
} catch (IOException ioe ) {
// nothing to see here
}
}
}
}
From: How do you Programmatically Download a Webpage in Java
The URL used in this code was obtained by doing a Google search query from the Google homepage. For some reason I do not understand, if I write the phrase that I want to search for in the URL bar of my web browser and then use the URL of the resulting search result page in the code I get a 403 error.
This code, however, did not return the html of the search query result page. Instead, it returned the source code of the Google homepage.
After doing further research I noticed that if you view the source code of a Google search query result (by right clicking on the background of the search result page and selecting “View page source”) and compare it with the source code of the Google homepage, they are both identical.
If instead of viewing the source code of the search result page I save the html of the search result page (by pressing ctrl+s), I can get the html that I am looking for.
Is there a way to retrieve the html of the search result page using Java?
Thank you!
Rather than parsing the resulting HTML page from a standard google search, perhaps you would be better off looking at the official Custom Search api to return results from Google in a more usable format. The API is definitely the way to go; otherwise your code could simply break if Google were to change some features of the google.com front-end's html. The API is designed to be used by developers and your code would be far less fragile.
To answer your question, though: We can't really help you just from the information you've provided. Your code seems to retrieve the html of stackoverflow; an exact copy-and-paste of the code from the question you linked to. Did you attempt to change the code at all? What URL are you actually trying to use to retrieve google search results?
I tried to run your code using url = new URL("http://www.google.com/search?q=test"); and I personally get an HTTP error 403 forbidden. A quick search of the problem says that this happens if I don't provide the User-Agent header in the web request, though that doesn't exactly help you if you're actually getting HTML returned. You will have to provide more information if you wish to receive specific help - though switching to the Custom Search API will likely solve your problem.
edit: new information provided in original question; can directly answer question now!
I figured out your problem after packet-capturing the web request that java was sending and applying some basic debugging... Let's take a look!
Here's the web request that Java was sending with your provided example URL:
GET / HTTP/1.1
User-Agent: Java/1.6.0_30
Host: www.google.com
Accept: text/html, image/gif, image/jpeg, *; q=.2, */*; q=.2
Connection: keep-alive
Notice that the request seemed to ignore most of the URL... leaving just the "GET /". That is strange. I had to look this one up.
As per the documentation of the Java URL class (and this is standard for all web pages), A URL may have appended to it a "fragment", also known as a "ref" or a "reference". The fragment is indicated by the sharp sign character "#" followed by more characters ... This fragment is not technically part of the URL.
Let's take a look at your example URL...
https://www.google.com/#hl=en&output=search&sclient=psy-ab&q=UCF&oq=UCF&aq=f&aqi=g4&aql=&gs_l=hp.3..0l4.1066.1471.0.1862.3.3.0.0.0.0.382.1028.2-1j2.3.0...0.0.OxbV2LOXcaY&pbx=1&bav=on.2,or.r_gc.r_pw.r_cp.r_qf.,cf.osb&fp=579625c09319dd01&biw=944&bih=951
notice that "#" is the first character in the file path? Java is simply ignoring everything after the "#" because sharp-signs are only used by the client / web browser - this leaves you with the url https://www.google.com/. Hey, at least it was working as intended!
I can't tell you exactly what Google is doing, but the sharp-symbol url definitely means that Google is returning results of the query through some client-side (ajax / javascript) scripting. I'd be willing to bet that any query you send directly to the server (i.e- no "#" symbol) without the proper headers will return a 403 forbidden error - looks like they're encouraging you to use the API :)
edit2: As per Tengji Zhang answer to the question, here is working code that returns the result of the google query for "test"
URL url;
InputStream is = null;
DataInputStream dis;
String line;
URLConnection c;
try {
url = new URL("https://www.google.com/search?q=test");
c = url.openConnection();
c.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.168");
c.connect();
is = c.getInputStream();
dis = new DataInputStream(new BufferedInputStream(is));
while ((line = dis.readLine()) != null) {
System.out.println(line);
}
} catch (MalformedURLException mue) {
mue.printStackTrace();
} catch (IOException ioe) {
ioe.printStackTrace();
} finally {
try {
is.close();
} catch (IOException ioe ) {
// nothing to see here
}
}
I suggest you try http://seleniumhq.org/
There is a good tutorial of searching in google
http://code.google.com/p/selenium/wiki/GettingStarted
you don't set the User-Agent in your code.
URLConnection.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.168");
Or you can read that "http://www.google.com/robots.txt". This file tells you which url is allowed by the google servers.
The below code is successful.
package org.test.stackoverflow;
import java.io.*;
import java.net.*;
import java.util.*;
public class SearcherRetriver {
public static void main (String args[]) {
URL url;
InputStream is = null;
DataInputStream dis;
String line;
URLConnection c;
try {
url = new URL("https://www.google.com.hk/#hl=en&output=search&sclient=psy-ab&q=UCF&oq=UCF&aq=f&aqi=g4&aql=&gs_l=hp.3..0l4.1066.1471.0.1862.3.3.0.0.0.0.382.1028.2-1j2.3.0...0.0.OxbV2LOXcaY&pbx=1&bav=on.2,or.r_gc.r_pw.r_cp.r_qf.,cf.osb&fp=579625c09319dd01&biw=944&bih=951");
c = url.openConnection();
c.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.168");
c.connect();
is = c.getInputStream();
dis = new DataInputStream(new BufferedInputStream(is));
while ((line = dis.readLine()) != null) {
System.out.println(line);
}
} catch (MalformedURLException mue) {
mue.printStackTrace();
} catch (IOException ioe) {
ioe.printStackTrace();
} finally {
try {
is.close();
} catch (IOException ioe ) {
// nothing to see here
}
}
}
}
I'm trying to GET a url using HTTPUrlConnection, however I'm always getting a 500 code, but when I try to access that same url from the browser or using curl, it works fine!
This is the code
try{
URL url = new URL("theurl");
HttpURLConnection httpcon = (HttpURLConnection) url.openConnection();
httpcon.setRequestProperty("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8");
httpcon.setRequestProperty("User-Agent", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:14.0) Gecko/20100101 Firefox/14.0.1");
System.out.println(httpcon.getHeaderFields());
}catch (Exception e) {
System.out.println("exception "+e);
}
When I print the headerfields, it shows the 500 code.. when I change the URL to something else like google.com , it works fine. But I don't understand why it doesn't work here but it works fine on the browser and with curl.
Any help would be highly appreciated..
Thank you,
This is mostly happening because of encoding.
If you are using browser OK, but getting 500 ( internal server error ) in your program,it is because the browsers have a highly sophisticated code regarding charsets and content-types.
Here is my code and it works in the case of ISO8859_1 as charset and english language.
public void sendPost(String Url, String params) throws Exception {
String url=Url;
URL obj = new URL(url);
HttpsURLConnection con = (HttpsURLConnection) obj.openConnection();
con.setRequestProperty("Acceptcharset", "en-us");
con.setRequestProperty("Accept-Language", "en-US,en;q=0.5");
con.setRequestProperty("charset", "EN-US");
con.setRequestProperty("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8");
String urlParameters=params;
// Send post request
con.setDoOutput(true);
con.setDoInput(true);
con.connect();
//con.
DataOutputStream wr = new DataOutputStream(con.getOutputStream());
wr.writeBytes(urlParameters);
wr.flush();
wr.close();
int responseCode = con.getResponseCode();
System.out.println("\nSending 'POST' request to URL : " + url);
System.out.println("Post parameters : " + urlParameters);
System.out.println("Response Code : " + responseCode);
BufferedReader in = new BufferedReader(
new InputStreamReader(con.getInputStream()));
String inputLine;
StringBuffer response = new StringBuffer();
while ((inputLine = in.readLine()) != null) {
response.append(inputLine);
}
in.close();
//print result
System.out.println(response.toString());
this.response=response.toString();
con.disconnect();
}
and in the main program , call it like this:
myclassname.sendPost("https://change.this2webaddress.desphilboy.com/websitealias/orwebpath/someaction","paramname="+URLEncoder.encode(urlparam,"ISO8859_1"))
The status code 500 suggests that the code at web server have been crashed .Use HttpURLConnection#getErrorStream() to get more idea of the error. Refer Http Status Code 500
I ran into the problem of "URL works in browser, but when I do http-get in java I get a 500 Error".
In my case the problem was that the regular http-get ended up in an infinite redirect loop between /default.aspx and /login.aspx
URL oUrl = new URL(url);
HttpURLConnection con = (HttpURLConnection) oUrl.openConnection();
con.setRequestMethod("GET");
...
int responseCode = con.getResponseCode();
What was happening was: The server serves up a three-part cookie and con.getResponseCode() only used one of the parts. The cookie data in the header looked like this:
header.key = null
value = HTTP/1.1 302 Found
...
header.key = Location
value = /default.aspx
header.key = Set-Cookie
value = WebCom-lbal=qxmgueUmKZvx8zjxPftC/bHT/g/rUrJXyOoX3YKnYJxEHwILnR13ojZmkkocFI7ZzU0aX9pVtJ93yNg=; path=/
value = USE_RESPONSIVE_GUI=1; expires=Wed, 17-Apr-2115 18:22:11 GMT; path=/
value = ASP.NET_SessionId=bf0bxkfawdwfr10ipmvviq3d; path=/; HttpOnly
...
So the server when receiving only a third of the needed data got confused: You're logged in! No wait, you have to login. No, you're logged in, ...
To work around the infinite redirect-loop I had to manually look for re-directs and manually parse through the header for "Set-cookie" entries.
con = (HttpURLConnection) oUrl.openConnection();
con.setRequestMethod("GET");
...
log.debug("Disable auto-redirect. We have to look at each redirect manually");
con.setInstanceFollowRedirects(false);
....
int responseCode = con.getResponseCode();
With this code the parsing of the cookie, if we get a redirect in the responseCode:
private String getNewCookiesIfAny(String origCookies, HttpURLConnection con) {
String result = null;
String key;
Set<Map.Entry<String, List<String>>> allHeaders = con.getHeaderFields().entrySet();
for (Map.Entry<String, List<String>> header : allHeaders) {
key = header.getKey();
if (key != null && key.equalsIgnoreCase(HttpHeaders.SET_COOKIE)) {
// get the cookie if need, for login
List<String> values = header.getValue();
for (String value : values) {
if (result == null || result.isEmpty()) {
result = value;
} else {
result = result + "; " + value;
}
}
}
}
if (result == null) {
log.debug("Reuse the original cookie");
result = origCookies;
}
return result;
}
Make sure that your connection allows following redirects - this is one of the possible reasons for difference in behaviour between your connection and the browser (allows redirect by default).
It should be returning code 3xx, but there maybe something else somewhere that changes it to 500 for your connection.
I faced the same issue, and our issue was there was a special symbol in one of the parameter values. We fixed it by using URLEncoder.encode(String, String)
In my case it turned out that the server always returns HTTP/1.1 500 (in Browser as in Java) for the page I wanted to access, but successfully delivers the webpage content nonetheless.
A human accessing the specific page via Browser just doesn't notice, since he will see the page and no error message, in Java I had to read the error stream instead of the input stream (thanks #Muse).
I have no idea why, though. Might be some obscure way to keep Crawlers out.
This is an old question, but I have had same issue and solved it this way.
This might help other is same situation.
In my case I was developing system on local environment, and every thing worked fine when I checked my Rest Api from browser but I got all the time thrown HTTP error 500 in my Android system.
The problem is when you work on Android, it works on VM (Virtual Machine), that said it means your local computer firewall might preventing your Virtual Machine accessing the local URL (IP) address.
You need just to allow that in your computer firewall. The same thing apply if you trying to access system from out side your network.
Check the parameter
httpURLConnection.setDoOutput(false);
Only for GET Method and set to true on POST, this save me lot of time!!!
Hey, I've tried researching how to POST data from java, and nothing seems to do what I want to do. Basically, theres a form for uploading an image to a server, and what I want to do is post an image to the same server - but from java. It also needs to have the right parameter name (whatever the form input's name is). I would also want to return the response from this method.
It baffles me as to why this is so difficult to find, since this seems like something so basic.
EDIT ---- Added code
Based on some of the stuff BalusC showed me, I created the following method. It still doesn't work, but its the most successful thing I've gotten yet (seems to post something to the other server, and returns some kind of response - I'm not sure I got the response correctly though):
EDIT2 ---- added to code based on BalusC's feedback
EDIT3 ---- posting code that pretty much works, but seems to have an issue:
....
FileItemFactory factory = new DiskFileItemFactory();
// Create a new file upload handler
ServletFileUpload upload = new ServletFileUpload(factory);
// Parse the request
List<FileItem> items = upload.parseRequest(req);
// Process the uploaded items
for(FileItem item : items) {
if( ! item.isFormField()) {
String fieldName = item.getFieldName();
String fileName = item.getName();
String itemContentType = item.getContentType();
boolean isInMemory = item.isInMemory();
long sizeInBytes = item.getSize();
// POST the file to the cdn uploader
postDataRequestToUrl("<the host im uploading too>", "uploadedfile", fileName, item.get());
} else {
throw new RuntimeException("Not expecting any form fields");
}
}
....
// Post a request to specified URL. Get response as a string.
public static void postDataRequestToUrl(String url, String paramName, String fileName, byte[] requestFileData) throws IOException {
URLConnection connection=null;
try{
String boundary = Long.toHexString(System.currentTimeMillis()); // Just generate some unique random value.
String charset = "utf-8";
connection = new URL(url).openConnection();
connection.setDoOutput(true);
connection.setRequestProperty("Content-Type", "multipart/form-data; boundary=" + boundary);
PrintWriter writer = null;
OutputStream output = null;
try {
output = connection.getOutputStream();
writer = new PrintWriter(new OutputStreamWriter(output, charset), true); // true = autoFlush, important!
// Send binary file.
writer.println("--" + boundary);
writer.println("Content-Disposition: form-data; name=\""+paramName+"\"; filename=\"" + fileName + "\"");
writer.println("Content-Type: " + URLConnection.guessContentTypeFromName(fileName));
writer.println("Content-Transfer-Encoding: binary");
writer.println();
output.write(requestFileData, 0, requestFileData.length);
output.flush(); // Important! Output cannot be closed. Close of writer will close output as well.
writer.println(); // Important! Indicates end of binary boundary.
// End of multipart/form-data.
writer.println("--" + boundary + "--");
} finally {
if (writer != null) writer.close();
if (output != null) output.close();
}
//* screw the response
int status = ((HttpURLConnection) connection).getResponseCode();
logger.info("Status: "+status);
for (Map.Entry<String, List<String>> header : connection.getHeaderFields().entrySet()) {
logger.info(header.getKey() + "=" + header.getValue());
}
} catch(Throwable e) {
logger.info("Problem",e);
}
}
I can see this code uploading the file, but only after I shutdown the tomcat. This leads me to believe that I'm leaving some sort of connection open.
This worked!
The core API you'd like to use is java.net.URLConnection. This is however pretty low level and verbose. You'd like to learn about the HTTP specifics in detail and take them into account (headers, etcetera). You can find here a related question with lot of examples.
A more convenient HTTP client API is the Apache Commons HttpComponents Client. You can find an example here.
Update: as per your update: you should read the response as a character stream, not as a binary stream and attempt to cast a byte to a char. This ain't going to work. Head to the Gathering HTTP response information part in the linked question with examples. Here's how it should look like:
BufferedReader reader = null;
StringBuilder builder = new StringBuilder();
try {
reader = new BufferedReader(new InputStreamReader(response, charset));
for (String line; (line = reader.readLine()) != null;) {
builder.append(line);
}
} finally {
if (reader != null) try { reader.close(); } catch (IOException logOrIgnore) {}
}
return builder.toString();
Update 2: as per your second update. Seeing the way how you continue to attampt reading/writing streams, I think it's high time to learn the basic Java IO :) Well, this part is also answered in the linked question. You would like to use Apache Commons FileUpload to parse a multipart/form-data request in a servlet. How to use it is also described/linked in the linked question. Look at the bottom of the Uploading files chapter. By the way, the content length header would return zero since you are not explicitly setting it (and also cannot do without buffering the entire request in memory).
Update 3:
I can see this code uploading the file, but only after I shutdown the tomcat. This leads me to believe that I'm leaving some sort of connection open.
You need to close the OutputStream with which you wrote the file to the disk. Once again, read the above linked basic Java IO tutorial.
What have you tried? If you google for Http Post Java, dozens of pages appear - what's wrong with them? This one, http://www.devx.com/Java/Article/17679/1954 for example, appears pretty decent.