Using Regex Pattern with Jsoup on a Weblink - java

I'm trying to make a parser to get products info on a Website. I've made a similar tool with Php and Regex, and I wish to do the same with Java. The objective is to get a parent link, to make child products links with regex for getting their products info in a loop
String curl = TextField1.getText();
URL url = new URL(curl);
URLConnection spoof = url.openConnection();
spoof.setRequestProperty( "User-Agent", "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0; H010818)" );
BufferedReader in = new BufferedReader(new InputStreamReader(spoof.getInputStream(),"UTF-8"));
String strLine = "";
while ((strLine = in.readLine()) != null){
Pattern pattern = Pattern.compile("style='color:#000000;font-weight:bold;'>(.*?)</a>");
strLine = strLine.replaceAll(" ","_");
strLine = strLine.replaceAll("d'","d");
Matcher m = pattern.matcher(strLine);
while(m.find()){
String enfurl = "http://www.exemple.com/fr/"+m.group(1)+".htm";
System.out.println(enfurl);
}
}
This code works, but someone tell me that Jsoup is a better solution to parse html. I'm reading the Jsoup documentation, but after establish a connexion, I don't know which syntax I must choose. Could you help me ?
EDIT : Ok, with this code :
Elements links = doc.select("a[href][title*=Cliquer pour obtenir des détails]");
for (Element link : links) {
System.out.println(link.attr("href"));
String urlenf = link.attr("href");
Document docenf = Jsoup.connect(urlenf).get();
System.out.println(docenf.body().text());
}
I've got the links... but now, I must open another Jsoup connexion to get product info, and this test don't works. How Could I use another Jsoup in the for loop ? thanks

Try to get the urls (and generally, the content) like this.
String url = "PAGE_URL_GOES_HERE";
InputStream is = new URL(url).openStream();
String encoding = "UTF-8";
Document doc = Jsoup.parse(is , encoding , url);
Update
Are you sure the problem is with the encoding of the url?
I tried the below code, and it works just fine.
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class Main {
public static void main(String[] args) {
try {
String url = "http://www.larousse.fr/dictionnaires/francais-anglais/écrémer/27576?q=écrémé";
Document doc = Jsoup.connect(url)
.userAgent("Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0; H010818)")
.get();
System.out.println(doc.toString());
} catch (Exception e) {
e.printStackTrace();
}
}
}
Update 2
In any case, try this one too, Jsoup.connect(new String(url.getBytes("UTF-8")))

There are plenty of examples of jsoup usage on the net.
Document document = Jsoup.connect(targerUrl).get(); //get html page
Elements descElements = document
.select("table#searchResult td:nth-child(2) font.detDesc"); // find elemets by css selectors
for (int i = 0; i < descElements.size(); i++) {
String torrentDesc = descElements.get(i).html(); //get tag content
}

Related

Google custom image search returns bad images

I want to get images from the Google Custom Search API. My problem is that Iam getting very weird images and no matter what I change in the settings.
keywords: empty
edition: free, with ads
image search: on
safe search: off
speech input: off
language: english
sites to search: -
restrictions: empty
search entire web: on
(Sorry if something is wrong translated, my UI is in german).
Some other user also had this problem but his solution didnt help me. Google custom search - poor image results
So no matter what I change in the settings, Iam getting the same images.
If I search "apfel" (english: apple) Iam getting this image link:
https://scontent-atl3-1.cdninstagram.com/v/t51.2885-19/s150x150/31514744_140795226776868_4684314220345425920_n.jpg?_nc_ht=scontent-atl3-1.cdninstagram.com&_nc_ohc=FdhVBUbROnkAX9AJdVR&oh=ea552d4c8b23acd0a3c82d83632e0895&oe=5ECA7F0E
But when I search it in the UI I get this:
It should not be the issue but here the code:
public static void main(String[] args) throws Exception {
String key = "";
String cx = "";
String keyword = "apfel";
URL url = new URL("https://www.googleapis.com/customsearch/v1?key=" + key + "&cx=" + cx + "&q=" + keyword);
HttpURLConnection conn = (HttpURLConnection) url.openConnection();
conn.addRequestProperty("User-Agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)");
BufferedReader br = new BufferedReader(new InputStreamReader((conn.getInputStream())));
String output;
System.out.println("Output from Server .... \n");
while ((output = br.readLine()) != null) {
if(output.contains("\"src\": \"")){
System.out.println(output); //Will print the google search links
}
}
conn.disconnect();
}

Google custom image search - only returns website images

I want to fetch some google images with the Google custom search API. But instead of the google images Iam getting the thumbnails of the websites. Here an example:
Iam getting the link of these thumbnail images:
But I want to have the links of these images:
Maybe somoene can tell me how to do that!
The Code:
public static void main(String[] args) throws Exception {
String key = "";
String cx = "";
String keyword = "coke";
URL url = new URL("https://www.googleapis.com/customsearch/v1?key=" + key + "&cx=" + cx + "&q=" + keyword);
HttpURLConnection conn = (HttpURLConnection) url.openConnection();
conn.addRequestProperty("User-Agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)");
BufferedReader br = new BufferedReader(new InputStreamReader((conn.getInputStream())));
String output;
System.out.println("Output from Server .... \n");
while ((output = br.readLine()) != null) {
if((output.contains("jpg") || output.contains("png")) && output.contains("src")){
System.out.println(output); //Will print the google search links
}
}
conn.disconnect();
}
Thanks a lot!
You aren't specifying that you want image search from Google. You are just searching for possible images in normal results. You'll need to add searchType=image.
Check this question and learn more about querying here.

Unable to parse img and name from amazon or flipkart pages using Jsoup

I am unable to get main image and name for products at Amazon or Flipkart using Jsoup.
My java/jsoup code for the same is:
// For amazon
Connection connection = Jsoup.connect(url).timeout(5000).maxBodySize(1024*1024*10);
Document doc = connection.get();
Elements imgs = doc.select("img#landingImage");
Elements names = doc.select("span#productTitle");
// For flipkart
Connection connection = Jsoup.connect(url).timeout(5000).maxBodySize(1024*1024*10);
Document doc = connection.get();
Elements imgs = doc.select("h1.title");
Elements names = doc.select("img.productImage.current");
Can someone please point out what am I missing here?
URLs I have used are:
http://www.flipkart.com/lenovo-yoga-2-tablet-android-10-inch/p/itmeyqkznqa2zjf5?pid=TABEYQKXWAXMSGER&srno=b_2&offer=ExchangeOffer_LenovoYoga.&ref=9ea008ab-ae95-4f52-8ef7-3ef1a54947ae
and
http://www.amazon.com/gp/product/B00LZGBU3Y/ref=s9_psimh_gw_p504_d0_i5?pf_rd_m=ATVPDKIKX0DER&pf_rd_s=desktop-1&pf_rd_r=0ESK1KNE31TBRVC8115Q&pf_rd_t=36701&pf_rd_p=1970559082&pf_rd_i=desktop
Also, I would like to do this parsing on the front end if possible using javascript and jquery.
Is there a way to do the same?
Found out the issue.
Jsoup in GAE works when we use the URL fetch service using java.net.URL as:
private String read(String url) throws IOException
{
URL urlObj = new URL(url);
BufferedReader reader = new BufferedReader(new InputStreamReader(urlObj .openStream()));
String line;
StringBuffer sbuf = new StringBuffer();
while ((line = reader.readLine()) != null) {
if (line.trim().length() > 0)
sbuf.append(line).append("\n");
}
reader.close();
return sbuf.toString();
}
And then you use regular Jsoup as:
String html = read(url);
Document doc = Jsoup.parse(html);
Doing the above works very well.

HttpsURLConnection and POST

some time ago i wrote this program in python, that logged in a website using https, took some info and logged out.
The program was quite simple:
class Richiesta(object):
def __init__(self,url,data):
self.url = url
self.data = ""
self.content = ""
for k, v in data.iteritems():
self.data += str(k)+"="+str(v)+"&"
if(self.data == ""):
self.req = urllib2.Request(self.url)
else:
self.req = urllib2.Request(self.url,self.data)
self.req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 5.1; rv:2.0b6) Gecko/20100101 Firefox/4.0b6')
self.req.add_header('Referer', baseurl+'/www/')
self.req.add_header('Cookie', cookie )
def leggi(self):
while(self.content == ""):
try:
r = urllib2.urlopen(self.req)
except urllib2.HTTPError, e:
print("Errore del server, nuovo tentativo tra 15 secondi")
time.sleep(15)
except urllib2.URLError, e:
print("Problema di rete, proverò a riconnettermi tra 20 secondi")
time.sleep(20)
else:
self.content = r.read().decode('utf-8')
def login(username,password):
global cookie
print("Inizio la procedura di login")
url = "https://example.com/auth/Authenticate"
data = {"login":"1","username":username,"password":password}
f = Richiesta(url,data)
f.leggi()
Now, for some reason, I have to translate it in java. Untill now, this is what i've writte:
import java.net.*;
import java.security.Security.*;
import java.io.*;
import javax.net.ssl.*;
public class SafeReq {
String baseurl = "http://www.example.com";
String useragent = "Mozilla/5.0 (Windows NT 5.1; rv:2.0b6) Gecko/20100101 Firefox/4.0b6";
String content = "";
public SafeReq(String s, String sid, String data) throws MalformedURLException {
try{
URL url = new URL(s);
HttpsURLConnection request = ( HttpsURLConnection ) url.openConnection();
request.setUseCaches(false);
request.setDoOutput(true);
request.setDoInput(true);
request.setFollowRedirects(true);
request.setInstanceFollowRedirects(true);
request.setRequestProperty("User-Agent",useragent);
request.setRequestProperty("Referer","http://www.example.com/www/");
request.setRequestProperty("Cookie","sid="+sid);
request.setRequestProperty("Origin","http://www.example.com");
request.setRequestProperty("Content-Type","application/x-www-form-urlencoded");
request.setRequestProperty("Content-length",String.valueOf(data.length()));
request.setRequestMethod("POST");
OutputStreamWriter post = new OutputStreamWriter(request.getOutputStream());
post.write(data);
post.flush();
BufferedReader in = new BufferedReader(new InputStreamReader(request.getInputStream()));
String inputLine;
while ((inputLine = in.readLine()) != null) {
content += inputLine;
}
post.close();
in.close();
} catch (IOException e){
e.printStackTrace();
}
}
public String leggi(){
return content;
}
}
The problem is, the login doesn't work, and when i try to get a page that require me to be logged, i get the "Login Again" message.
The two classes seems quite the same, and i can't understand why i can't make the second one to work ... any idea?
Where do you get your sid from? From the symptoms, I would guess that your session cookie is not passed correctly to the server.
See this question for possible solution: Cookies turned off with Java URLConnection.
In general, I recommend you to use HttpClient for implementing HTTP conversations in Java (anything more complicated than a simple one-time GET or POST). See code examples (I guess "Form based logon" example is appropriate in your case).
Anyone looking for this in the future, take a look at HtmlUnit.
This answer has a nice example.

Google Language detection api replying error code 406

I am trying to use Google language detection API, Right now I am using the sample available on Google documentation as follows:
public static String googleLangDetection(String str) throws IOException, JSONException{
String urlStr = "http://ajax.googleapis.com/ajax/services/language/detect?v=1.0&q=";
// String urlStr = "http://ajax.googleapis.com/ajax/services/search/web?v=1.0&q=Paris%20Hilton";
URL url = new URL(urlStr+str);
URLConnection connection = url.openConnection();
// connection.addRequestProperty("Referer","http://www.hpeprint.com");
String line;
StringBuilder builder = new StringBuilder();
BufferedReader reader = new BufferedReader(new InputStreamReader(connection.getInputStream()));
while((line = reader.readLine()) != null) {
builder.append(line);
}
JSONObject json = new JSONObject(builder.toString());
for (Iterator iterator = json.keys(); iterator.hasNext();) {
String type = (String) iterator.next();
System.out.println(type);
}
return json.getString("language");
}
But I am getting http error code '406'.
I am unable to understand what the problem is? As the google search query(commented) below it is working fine.
The resultant language detection url itself is working fine when I run it in firefox or IE but it's failing in my java code.
Is there something I am doing wrong?
Thanks in advance
Ashish
As a guess, whatever is being passed in on str has characters that are invalid in a URL, as the error code 406 is Not Acceptable, and looks to be returned when there is a content encoding issue.
After a quick google, it looks like you need to run your str through the java.net.URLEncoder class, then append it to the URL.
Found the answer at following link:
HTTP URL Address Encoding in Java
Had to modify the code as follows:
URI uri = new URI("http","ajax.googleapis.com","/ajax/services/language/detect","v=1.0&q="+str,null);
URL url = uri.toURL();

Categories