How to parse/query a static html page using java and httpclient - java

Here is the Http Post request using the HttpClient API.This piece of code lets me fetch the entire page content in the raw html format.My requirement is to send some parameters to this method such that it fetches only required data(like in a database) in readable format.By 'readable' I mean without the html tags.I dont know if JSON has to be brought in here.
I went through JSoup , but it seems to be doing the job of a scraper.
So how should I proceed with the html content I currently have?
private void sendPost() throws Exception {
String url = "https://www.elitmus.com/jobs";
HttpClient client = HttpClientBuilder.create().build();
HttpPost post = new HttpPost(url);
// add header
post.setHeader("User-Agent", USER_AGENT);
List<NameValuePair> urlParameters = new ArrayList<NameValuePair>();
urlParameters.add(new BasicNameValuePair("sn", "C02G8416DRJM"));
urlParameters.add(new BasicNameValuePair("cn", ""));
urlParameters.add(new BasicNameValuePair("locale", ""));
urlParameters.add(new BasicNameValuePair("caller", ""));
urlParameters.add(new BasicNameValuePair("num", "12345"));
post.setEntity(new UrlEncodedFormEntity(urlParameters));
HttpResponse response = client.execute(post);
System.out.println("\nSending 'POST' request to URL : " + url);
System.out.println("Post parameters : " + post.getEntity());
System.out.println("Response Code : " +
response.getStatusLine().getStatusCode());
BufferedReader rd = new BufferedReader(
new InputStreamReader(response.getEntity().getContent()));
StringBuffer result = new StringBuffer();
String line = "";
while ((line = rd.readLine()) != null) {
result.append(line);
}
System.out.println(result.toString());
}
}

Related

Why am I getting HTTP 400 bad request

I am using an HTTP client (code copied from http://www.mkyong.com/java/apache-httpclient-examples/) to send post requests. I have been trying to use it with http://postcodes.io to look up a bulk of postcodes but failed. According to http://postcodes.io I should send a post request to http://api.postcodes.io/postcodes in the following JSON form: {"postcodes" : ["OX49 5NU", "M32 0JG", "NE30 1DP"]} but I am always getting HTTP Response Code 400.
I have included my code below. Please tell me what am I doing wrong?
Thanks
private void sendPost() throws Exception {
String url = "http://api.postcodes.io/postcodes";
HttpClient client = HttpClientBuilder.create().build();
HttpPost post = new HttpPost(url);
List<NameValuePair> urlParameters = new ArrayList<NameValuePair>();
urlParameters.add(new BasicNameValuePair("postcodes", "[\"OX49 5NU\", \"M32 0JG\", \"NE30 1DP\"]"));
post.setEntity(new UrlEncodedFormEntity(urlParameters));
HttpResponse response = client.execute(post);
System.out.println("Response Code : "
+ response.getStatusLine().getStatusCode());
System.out.println("Reason : "
+ response.getStatusLine().getReasonPhrase());
BufferedReader br = new BufferedReader(
new InputStreamReader(response.getEntity().getContent()));
StringBuffer result = new StringBuffer();
String line = "";
while ((line = br.readLine()) != null) {
result.append(line);
}
br.close();
System.out.println(result.toString());
}
This works, HTTP.UTF_8 is deprecated:
String url = "http://api.postcodes.io/postcodes";
HttpClient client = HttpClientBuilder.create().build();
HttpPost post = new HttpPost(url);
StringEntity params =new StringEntity("{\"postcodes\" : [\"OX49 5NU\", \"M32 0JG\", \"NE30 1DP\"]}");
post.addHeader("Content-Type", "application/json");
post.setEntity(params);
Jon Skeet is right (as usual, I might add), you are basically sending a form and it defaults to form-url-encoding.
You could try something like this instead:
String jsonString = "{\"postcodes\" : [\"OX49 5NU\", \"M32 0JG\", \"NE30 1DP\"]}";
StringEntity entity = new StringEntity(jsonObj.toString(), HTTP.UTF_8);
entity.setContentType("application/json");
post.setEntity(entity);

settHeader ("content type", "") - confusion

I have a function with which I want to POST two variables to the php side, after these two variables match and the server processes the result, I want to return result in JSON. As of now my set header property looks like the following:
httppost.setHeader("Content-type", "application/json");
But while reading on at Wikipedia I found that the content type should be application/x-www-form-urlencoded and to accept JSON it should be Accept: application/json I want more clarity on this, how do I modify my code to achieve my desired result? As of now I am using local host and my POST variables seem to be not delivered on the php side. Following is my complete function:
public void parse(String last, String pwd){
String lastIndex = last;
DefaultHttpClient http = new DefaultHttpClient(new BasicHttpParams());
System.out.println("URL is: "+CONNECT_URL);
HttpPost httppost = new HttpPost(CONNECT_URL);
httppost.setHeader("Content-type", "application/json");
try{
List<NameValuePair> nameValuePairs = new ArrayList<NameValuePair>(2);
nameValuePairs.add(new BasicNameValuePair("key", password));
nameValuePairs.add(new BasicNameValuePair("last_index", lastIndex));
httppost.setEntity(new UrlEncodedFormEntity(nameValuePairs));
System.out.println("Post variables(Key): "+password+"");
System.out.println("Post variables(last index): "+lastIndex);
HttpResponse resp = http.execute(httppost);
HttpEntity entity = resp.getEntity();
ins = entity.getContent();
BufferedReader bufread = new BufferedReader(new InputStreamReader(ins, "UTF-8"), 8);
StringBuilder sb = new StringBuilder();
String line = null;
while((line = bufread.readLine()) != null){
sb.append(line +"\n");
}
result = sb.toString();
System.out.println("Result: "+result);
// readAndParseJSON(result);
}catch (Exception e){
System.out.println("Error: "+e);
}finally{
try{
if(ins != null){
ins.close();
}
}catch(Exception smash){
System.out.println("Squish: "+smash);
}
}
// return result;
}
You have a caps problem. Try "Content-Type" rather than "Content-type" (or use the const HTTP.CONTENT_TYPE).
It appears that your code is actually doing what that article describes, except that
// httppost.setHeader("Content-type", "application/json");
httppost.setHeader("Content-Type", "application/x-www-form-urlencoded");
httppost.setHeader("Accept", "application/json");
You are adding the x-www-form-urlencoded content here
httppost.setEntity(new UrlEncodedFormEntity(nameValuePairs));

What is the Java equivalent for the following in curl?

curl https://view-api.box.com/1/documents \
-H "Authorization: Token YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{"url": "https://cloud.box.com/shared/static/4qhegqxubg8ox0uj5ys8.pdf"}' \
-X POST
How do you accomodate the url?
This is what I tried so far.
final String url = "https://view-api.box.com/1/documents";
#SuppressWarnings("resource")
final HttpClient client = HttpClientBuilder.create().build();
final HttpPost post = new HttpPost(url);
post.setHeader("Authorization", "Token: TOKEN_ID");
post.setHeader("Content-Type", "application/json");
final List<NameValuePair> urlParameters = new ArrayList<NameValuePair>();
urlParameters.add(new BasicNameValuePair("url", "https://cloud.box.com/shared/static/4qhegqxubg8ox0uj5ys8.pdf"));
post.setEntity(new UrlEncodedFormEntity(urlParameters));
final HttpResponse response = client.execute(post);
System.out.println("Response Code : " + response.getStatusLine().getStatusCode());
final BufferedReader rd = new BufferedReader(new InputStreamReader(response.getEntity().getContent()));
final StringBuffer result = new StringBuffer();
String line = "";
while ((line = rd.readLine()) != null) {
result.append(line);
}
}
You have everything ok except the entity, what you're sending in curl is not the content of an html form but a json object.
First take off this part (don't send your data as if it were application/x-www-form-urlencoded):
// comment out / delete this from your code:
final List<NameValuePair> urlParameters = new ArrayList<NameValuePair>();
urlParameters.add(new BasicNameValuePair("url", "https://cloud.box.com/shared/static/4qhegqxubg8ox0uj5ys8.pdf"));
post.setEntity(new UrlEncodedFormEntity(urlParameters));
And then add the body in this way:
BasicHttpEntity entity = new BasicHttpEntity();
InputStream body = new ByteArrayInputStream(
"{\"url\": \"https://cloud.box.com/shared/static/4qhegqxubg8ox0uj5ys8.pdf\"}".getBytes());
entity.setContent(body);
post.setEntity(entity);
I'm assuming that your JSON string only have chars between 0x20 and 0x7F, but if you use other characters (like Ñ) then you need to transform your data to a bytearray using the encoding UTF-8 (the standard encoding used in JSON data) in this way:
BasicHttpEntity entity = new BasicHttpEntity();
String myData = "{\"url\": \"https://cloud.box.com/shared/static/4qhegqxubg8ox0uj5ys8.pdf\"}";
ByteArrayOutputStream rawBytes = new ByteArrayOutputStream();
OutputStreamWriter writer = new OutputStreamWriter(rawBytes,
Charset.forName("UTF-8"));
writer.append(myData);
InputStream body = new ByteArrayInputStream(rawBytes.toByteArray());
entity.setContent(body);
post.setEntity(entity);
I would suggest the following - although I can't remember if the StringEntity is available under HTTPClient
final String url = "https://view-api.box.com/1/documents";
#SuppressWarnings("resource")
final HttpClient client = HttpClientBuilder.create().build();
final HttpPost post = new HttpPost(url);
post.setHeader("Authorization", "Token: TOKEN_ID");
post.setHeader("Content-Type", "application/json");
post.setEntity(new StringEntity("{\"url\": \"https://cloud.box.com/shared/static/4qhegqxubg8ox0uj5ys8.pdf\"}"));
final HttpResponse response = client.execute(post);
System.out.println("Response Code : " + response.getStatusLine().getStatusCode());
final BufferedReader rd = new BufferedReader(new InputStreamReader(response.getEntity().getContent()));
final StringBuffer result = new StringBuffer();
String line = "";
while ((line = rd.readLine()) != null) {
result.append(line);
}
}

Java Apache HttpClient Submitting Form

I am trying to submit a form on this website, and get back the resulting misspellings from the text area as a string (only the "Reverse letters" checkbox should be selected). I have the code below, adapted from here:
private static void sendPost() throws Exception {
String url = "http://tools.seobook.com/spelling/keywords-typos.cgi";
HttpClient client = new DefaultHttpClient();
HttpPost post = new HttpPost(url);
post.setHeader("User-Agent", "Mozilla/5.0"); // add header
List<NameValuePair> urlParameters = new ArrayList<NameValuePair>();
//the input text area
urlParameters.add(new BasicNameValuePair("user_input", "tomato potato"));
//the checkbox
urlParameters.add(new BasicNameValuePair("reverse_letters", "reverse_letters"));
//the submit button (?)
urlParameters.add(new BasicNameValuePair("", "generate typos"));
post.setEntity(new UrlEncodedFormEntity(urlParameters));
HttpResponse response = client.execute(post);
System.out.println("\nSending 'POST' request to URL : " + url);
System.out.println("Post parameters : " + post.getEntity());
System.out.println("Response Code : " +
response.getStatusLine().getStatusCode());
BufferedReader rd = new BufferedReader(new InputStreamReader(
response.getEntity().getContent()));
StringBuffer result = new StringBuffer();
String line = "";
while ((line = rd.readLine()) != null) {
result.append(line + "\n");
}
System.out.println(result.toString());
}
If I copy and paste the lines from the console, and search through it in an editor for the misspellings, I do in fact have the input text and resulting text area text contained in the huge string. The string contains all html however, and I would like only the misspellings as a string. How would I extract only the resulting misspellings from this site, perhaps with a method as part of the Apache HttpClient Library, or I am taking the wrong approach?
Thanks, Dan
I think you are trying to put a square peg in a round hole, Selenium would probably be a better bet.
Apache http client is best used for request and response header handling not for processing the body of a response
An over complicated way would be to split the "result" variable using regex's

httpclient response

I am trying to autologin into a webpage. Im asssuming that i pass the proper credentials.
entity.getContentLength() shows 20 but the repsonse i see is not well formatted. It is not an HTML. How should i proceed further. Below is my code.
String input_text = "https://www.abc.com";
HttpPost httpost = new HttpPost(input_text);
List <NameValuePair> nvps = new ArrayList <NameValuePair>();
nvps.add(new BasicNameValuePair("email", "abc#xyz.com"));
nvps.add(new BasicNameValuePair("passsword", "ttyyeri"));
nvps.add(new BasicNameValuePair("publicLoginToken",""));
httpost.setEntity(new UrlEncodedFormEntity(nvps, HTTP.UTF_8));
HttpResponse response = httpclient.execute(httpost);
entity = response.getEntity();
if (entity != null) {
BufferedReader br = new BufferedReader(new InputStreamReader(entity.getContent()));
String readLine;
while(((readLine = br.readLine()) != null)) {
System.err.println("br :"+readLine);
}
System.out.println("Response content length: " + entity.getContentLength());
}
System.out.println("HTML Content :::"+entity.getContent().toString());
try
StatusLine l = response.getStatusLine();
System.out.println(l.getStatusCode() + " " + l .getReasonPhrase());
output ?
Sounds like you are getting an authorization request redirect. This may have already been covered here: Http Basic Authentication in Java using HttpClient?
Investigate the HttpResponse Header, you can find the content type and the response code.
Which will help you to find the problem.

Categories