Java - handling foreign characters - java

So, I have some Java code that fetches the contents of a HTML page as follows:
BufferedReader bf;
String response = "";
HttpURLConnection connection;
try
{
connection = (HttpURLConnection) url.openConnection();
connection.setInstanceFollowRedirects(false);
connection.setUseCaches(false);
connection.setRequestMethod("GET");
connection.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.0; WOW64) AppleWebKit/534.24 (KHTML, like Gecko) Chrome/11.0.696.16 Safari/534.24");
connection.connect();
bf = new BufferedReader(new InputStreamReader(connection.getInputStream()));
String line;
while ((line = bf.readLine()) != null) {
response += line;
}
connection.disconnect();
}
catch (Throwable ex)
{
response = "";
}
This works perfectly fine and will return the content to me as required. I then drill down to the area of code that I want to pull, which is as follows:
10€ de réduction chez Asos be!
Java seems to be handling the € fine since it is a HTML entity. The word "réduction" is problematic though. It seems to render it as:
10€ de r�duction chez Asos be!
As you can see it is struggling to handle the "é" character.
How do I go about solving this? I've been searching the internet and playing around with the code for the past few hours but no luck whatsoever! I'm very new to Java so it's all very difficult to get my head around.
Thanks in advance.

That code is ok but you might need to detect the character encoding of the response (see here) and pass it to the class that wraps the inputStream to get a Reader (see here).
Otherwise the problem is not reading the response but in the stuff you do with that response string.

Related

An issue with an URLConnection using java

I'm trying to read out the code of a website.
But there is an issue if I want to receive the code of this site for example: "https://www.amazon.de/gp/bestsellers/pet-supplies/#2"
I tried a lot, but still im just receiving the code of https://www.amazon.de/gp/bestsellers/pet-supplies". So something does not work right as I want to receive place 21-40 and not 1-20.
I'm using an URLConneciton and a BufferedReader:
public String fetchPage(String urlS){
String s = null;
String qc = null;
try{
URL url = new URL(urlS);
URLConnection uc = url.openConnection();
uc.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:51.0) Gecko/20100101 Firefox/51.0");
BufferedReader reader = new BufferedReader(new InputStreamReader(uc.getInputStream()));
while((s = reader.readLine()) != null){
qc += s;
}
reader.close();
} catch(IOException e) {
e.printStackTrace();
qc = "receiving qc failed";
}
return qc;
}
Thank you in advance for your effort :)
The URL you're fetching, contains an achor (the #2 at the end). An anchor is a client-side concept and is originally used to jump to a certain part of the page. Some webapps (mostly single-page apps) use the anchor to keep track of some sort of state (eg. what page of products you're viewing).
Since the anchor is a client side concept, the responding webserver (or your browser/HTTP client library) just drops any anchors as if you actually requested https://www.amazon.de/gp/bestsellers/pet-supplies.
Bottom line is that you'll never get the second page... Goog luck in scraping Amazon though ;)

Read XML from valid URL not returned. Header formatting issue?

I am trying to use the code below to read from a valid url. I can copy and paste the url into my browser and it works perfectly (displays the xml) but when I try to access it programatically it returns nothing (no data and no error). I have already tried to set the user-agent via this post: Can't read in HTML content from valid URL but it didnt fix my problem. If it matters I am trying to do a single Eve API call. I believe the problem is that I do not have my headers formatted correctly, and the Eve site is rejecting the query. I can access the data fine using PHP, but I recently had to change languages.
public static void readFileToXML(String urlString,String fName)
{
try{
java.net.URL url = new java.net.URL(urlString);
System.out.println(url);
URLConnection cnx = url.openConnection();
cnx.setAllowUserInteraction(false);
cnx.setDoOutput(true);
cnx.addRequestProperty("User-Agent", "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/531.0 (KHTML, like Gecko) Chrome/3.0.183.1 Safari/531.0");
System.out.println(cnx.getContentLengthLong());// a change suggested in the comments. returns -1
InputStream is = cnx.getInputStream();
BufferedReader br = new BufferedReader(new InputStreamReader(is));
File file=new File("C:\\Users\\xxx\\Desktop\\"+fName);
BufferedWriter bw = new BufferedWriter(new FileWriter(file,false));
String inputLine;
while ((inputLine = br.readLine()) != null) {
bw.write(inputLine);
System.out.println(inputLine);
}
System.out.println("Finished read");
bw.close();
br.close();
}
catch(Exception e)
{
System.out.println("Exception: "+e.getMessage());
}
}

Locate and extract only particular tag in HTML response Java

I am trying to find the gender of a name by using website "http://www.gpeters.com/names/baby-names.php".I was able to pass parameters using get request and get the html page as response like the following
URL url = new URL(
"http://www.gpeters.com/names/baby-names.php?name=sarah");
HttpURLConnection connection = null;
try {
// Create connection
connection = (HttpURLConnection) url.openConnection();
connection.setRequestMethod("GET");
connection.setRequestProperty("Content-Type",
"application/x-www-form-urlencoded");
connection.setRequestProperty("Content-Language", "en-US");
connection.setUseCaches(false);
connection.setDoInput(true);
connection.setDoOutput(true);
connection.connect();
// Get Response
InputStream is = connection.getInputStream();
int status = connection.getResponseCode();
//System.out.println(status);
BufferedReader rd = new BufferedReader(new InputStreamReader(is));
String line;
while ((line = rd.readLine()) != null) {
System.out.println(line);
}
rd.close();
//program prints whole HTML page as response.
The HTML response has a element like "It's a girl!" where the required result located.How do i extract only the above string and prints whether the input parameter is a boy or girl.Example:sarah is a girl..
Add jtidy to your project. Use it to convert HTML to XML. After that, you can use the standard XML tools like JDOM 2 or Jaxen to examine the data.
What you neeed to do is look at the HTML code and determine a unique path that allows you to identify the desired element. There are no simple solutions here. But some pointers:
Look for elements with id attributes since they are unique
Look for elements that are rare.
Look for unique texts

HttpsURLConnection and POST

some time ago i wrote this program in python, that logged in a website using https, took some info and logged out.
The program was quite simple:
class Richiesta(object):
def __init__(self,url,data):
self.url = url
self.data = ""
self.content = ""
for k, v in data.iteritems():
self.data += str(k)+"="+str(v)+"&"
if(self.data == ""):
self.req = urllib2.Request(self.url)
else:
self.req = urllib2.Request(self.url,self.data)
self.req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 5.1; rv:2.0b6) Gecko/20100101 Firefox/4.0b6')
self.req.add_header('Referer', baseurl+'/www/')
self.req.add_header('Cookie', cookie )
def leggi(self):
while(self.content == ""):
try:
r = urllib2.urlopen(self.req)
except urllib2.HTTPError, e:
print("Errore del server, nuovo tentativo tra 15 secondi")
time.sleep(15)
except urllib2.URLError, e:
print("Problema di rete, proverò a riconnettermi tra 20 secondi")
time.sleep(20)
else:
self.content = r.read().decode('utf-8')
def login(username,password):
global cookie
print("Inizio la procedura di login")
url = "https://example.com/auth/Authenticate"
data = {"login":"1","username":username,"password":password}
f = Richiesta(url,data)
f.leggi()
Now, for some reason, I have to translate it in java. Untill now, this is what i've writte:
import java.net.*;
import java.security.Security.*;
import java.io.*;
import javax.net.ssl.*;
public class SafeReq {
String baseurl = "http://www.example.com";
String useragent = "Mozilla/5.0 (Windows NT 5.1; rv:2.0b6) Gecko/20100101 Firefox/4.0b6";
String content = "";
public SafeReq(String s, String sid, String data) throws MalformedURLException {
try{
URL url = new URL(s);
HttpsURLConnection request = ( HttpsURLConnection ) url.openConnection();
request.setUseCaches(false);
request.setDoOutput(true);
request.setDoInput(true);
request.setFollowRedirects(true);
request.setInstanceFollowRedirects(true);
request.setRequestProperty("User-Agent",useragent);
request.setRequestProperty("Referer","http://www.example.com/www/");
request.setRequestProperty("Cookie","sid="+sid);
request.setRequestProperty("Origin","http://www.example.com");
request.setRequestProperty("Content-Type","application/x-www-form-urlencoded");
request.setRequestProperty("Content-length",String.valueOf(data.length()));
request.setRequestMethod("POST");
OutputStreamWriter post = new OutputStreamWriter(request.getOutputStream());
post.write(data);
post.flush();
BufferedReader in = new BufferedReader(new InputStreamReader(request.getInputStream()));
String inputLine;
while ((inputLine = in.readLine()) != null) {
content += inputLine;
}
post.close();
in.close();
} catch (IOException e){
e.printStackTrace();
}
}
public String leggi(){
return content;
}
}
The problem is, the login doesn't work, and when i try to get a page that require me to be logged, i get the "Login Again" message.
The two classes seems quite the same, and i can't understand why i can't make the second one to work ... any idea?
Where do you get your sid from? From the symptoms, I would guess that your session cookie is not passed correctly to the server.
See this question for possible solution: Cookies turned off with Java URLConnection.
In general, I recommend you to use HttpClient for implementing HTTP conversations in Java (anything more complicated than a simple one-time GET or POST). See code examples (I guess "Form based logon" example is appropriate in your case).
Anyone looking for this in the future, take a look at HtmlUnit.
This answer has a nice example.

URLConnection sometimes returns empty string response?

I'm making an http GET request. It works in about 70% of my attempts. For some reason, I sometimes get no response string from a successful connection. I just setup a button in my app which keeps firing the code below. One call might fail to reply with a string, the next call works fine:
private onButtonClick() {
try {
doit();
} catch (Exception ex) {
...
}
}
public void doit() throws Exception {
URL url = new URL("http://www.example.com/service");
HttpURLConnection connection = (HttpURLConnection)url.openConnection();
connection.setDoInput(true);
connection.setUseCaches(false);
connection.setAllowUserInteraction(false);
connection.setReadTimeout(30 * 1000);
connection.setRequestProperty("Connection", "Keep-Alive");
connection.setRequestProperty("Authorization",
"Basic " + Base64.encode("username" + ":" + "password"));
BufferedReader in = new BufferedReader(new InputStreamReader(connection.getInputStream()));
String line = "";
StringBuilder sb = new StringBuilder();
while ((line = in.readLine()) != null) {
sb.append(line);
}
in.close();
connection.disconnect();
// Every so often this prints an empty string!
System.out.println(sb.toString());
}
am I doing something wrong here? It seems like maybe I'm not closing the connection properly from the last call somehow and the response gets mangled or something? I am also calling doit() from multiple threads simultaneously, but I thought the contents of the method are thread-safe, same behavior though,
Thanks
Thanks
That method looks fine. It's reentrant, so calls shouldn't interfere with each other. It's probably a server issue, either deliberate throttling or just a bug.
EDIT: You can check the status code with getResponseCode.
For checking ResponseCode:
BufferedReader responseStream;
if (((HttpURLConnection) connection).getResponseCode() == 200) {
responseStream = new BufferedReader(new InputStreamReader(connection.getInputStream(), "UTF-8"));
} else {
responseStream = new BufferedReader(new InputStreamReader(((HttpURLConnection) connection).getErrorStream(), "UTF-8"));
}
For empty content resposneCode is 204. So if u can get empty body just add one more "if" with 204 code.
We also came across with the similar scenario, I came across the following solution for this issue:
- Setting up a user agent string on URLConnection object.
URLConnection conn = url.openConnection();
conn.setRequestProperty("User-Agent", "Mozilla/4.0 (compatible; MSIE 4.01; Windows NT)");
more details

Categories