Bing Search with Jsoup - how can I avoid captcha?

Bing Search with Jsoup - how can I avoid captcha? - java

keywordexist = false;
try {
res = Jsoup
.connect(
bingSearchUrl.replaceAll("keyword", "intitle:\""
+ keyword + "\""))
.userAgent(
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.15 (KHTML, like Gecko) Chrome/24.0.1295.0 Safari/537.15")
.referrer("http://www.bing.com")
.method(Connection.Method.GET).execute();
doc = res.parse();
System.out.println(bingSearchUrl.replaceAll("keyword", "intitle:\""
+ keyword + "\""));
elements = doc.select("li[class^=b_algo]");
System.out.println(doc.html());
System.out.println(elements.html());
// String divContents =
// doc.select(".id-app-orig-desc").first().text();
// elements.remove("div");
if (elements.html().contains("<strong>" + keyword + "</strong>")) {
keywordexist = true;
System.out.println("keyword exists");
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
I'm trying to use jsoup to check a list of keywords I have in Bing Search but whenever I run my program jsoup will always connect to Bing's captcha page, is there any way I can avoid this? I thought this would be remedied by adding a useragent and referrer but it doesn't seem to have any effect.

I used a code similar to yours and get all the results. However here are two points I noticed:
I think you should slow down between two searches. For example, add a random pause from 3000 to 5000 ms.
Don't forget to escape the query parameters
SAMPLE CODE
String bingSearchUrl = "http://www.bing.com/search?q=keyword";
String keyword = "stackoverflow jsoup";
String uaString = "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.15 (KHTML, like Gecko) Chrome/24.0.1295.0 Safari/537.15";
String url = bingSearchUrl.replaceAll("keyword", URLEncoder.encode("intitle:\"" + keyword + "\"", "UTF-8"));
Document doc = Jsoup.connect(url).userAgent(uaString).get();
System.out.println(doc.select("li h2"));

Related

Java 403 Exception When My Bot Tries To Send An Embedded Link

I got a discord bot that I've made in Java and one of its purposes is to send an embedded link (I don't own the site) everytime someone leaves the server. It worked the first 2-3 times and every time after that I get the following exception:
java.io.IOException: Server returned HTTP response code: 403 for URL: ...
Example link:
https://signature.hzgaming.net/sig.php?name=Juntao_Lubu&style=1
I tried numerous solutions I've found online (with User-Agents and all that fancy stuff), but none of them seem to work for me.
Is there any other workaround this?
Code:
String link = "https://signature.hzgaming.net/sig.php?name=" + allMembers.get(mEvent.getUser().getDiscriminatedName()).replace(" ", "_") + "&style=1";
URLConnection urlCon = new URL(link).openConnection();
urlCon.addRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.29 Safari/537.36");
InputStream is = urlCon.getInputStream();
StringBuilder textBuilder = new StringBuilder();
Reader reader = new BufferedReader(new InputStreamReader(is, Charset.forName(StandardCharsets.UTF_8.name())));
int c = 0;
while((c = reader.read()) != -1) {
textBuilder.append((char)c);
}
String result = textBuilder.toString().replaceAll("<[^>]*>", "");
if(!result.equalsIgnoreCase("Non-Existant Player") && !result.equalsIgnoreCase("Non-ExistantPlayer")) {
new MessageBuilder().append(link).send((TextChannel)server.getChannelById(973242211623895080L).get());
}
Thanks in advance.

Java I am unable to find more than 1 regex match

I am trying to scrape this web page. I need to find all names under "Rank name"
Website: https://secure.runescape.com/m=hiscore_oldschool/overall?table=0&page=1
But I am having a major issue. Only the first match is found (Lynx Titan). log(m.group(1).split("\"")[0]); matches with Lynx Titan. But if I replace [0] to 1 or 2, it does not work. It should display the next name. Any help is appreciated
public void getRsn() throws IOException {
URL url = new URL("https://secure.runescape.com/m=hiscore_oldschool/overall?table=0&page=1");
URLConnection conn = url.openConnection();
conn.setRequestProperty("User-Agent",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101 Firefox/91.0");
conn.connect();
BufferedReader serverResponse = new BufferedReader(
new InputStreamReader(conn.getInputStream()));
String s = serverResponse.lines().collect(Collectors.joining());
Matcher m = Pattern.compile("user1=(.*)\">").matcher(s);
while(m.find()) {
log(m.group(1).split("\"")[0]);
}
serverResponse.close();
}

Your problem is your regex: you're only matching one user. Also, each iteration of the while loop will find anotherr user, not each element of the split (which you actually don't need).
Use a regex that matches every user, and only the user's name:
Matcher m = Pattern.compile("(?<=user1=)[^\"]+").matcher(s);
while(m.find()) {
System.out.println(m.group()); // the entire match is the user name
}

Try this way,
URL url = new URL("https://secure.runescape.com/m=hiscore_oldschool/overall?table=0&page=1");
URLConnection conn = url.openConnection();
conn.setRequestProperty("User-Agent",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101 Firefox/91.0");
conn.connect();
BufferedReader serverResponse = new BufferedReader(
new InputStreamReader(conn.getInputStream(), StandardCharsets.UTF_8));
String s = serverResponse.lines().collect(Collectors.joining());
serverResponse.close();
Matcher m = Pattern.compile("(user1=)([a-zA-Z0-9\\s\\�]+)(\">)").matcher(s);
while(m.find()) {
System.out.println((m.group(2).replaceAll("\\�", " ")));
}
My output was,
Lynx Titan
Hey Jase
ShawnBay
senZe
Tomdabom
Karma
Harmony
DedWilson
GodTormentor
Vinny
borsi
Brundeen
Aziz
Eeli
baile y
gaslighter73
Dan Gleesac
blind idiot
he box jonge
Gustav
Randalicious
Oskar
Killzone
moksi
Capt King
Update
The Charsets is actually ISO_8859_1 not UTF_8. So you can make this answer better.
URL url = new URL("https://secure.runescape.com/m=hiscore_oldschool/overall?table=0&page=1");
URLConnection conn = url.openConnection();
conn.setRequestProperty("User-Agent",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101 Firefox/91.0");
conn.connect();
BufferedReader serverResponse = new BufferedReader(
new InputStreamReader(conn.getInputStream(), StandardCharsets.ISO_8859_1));
String s = serverResponse.lines().collect(Collectors.joining());
serverResponse.close();
Matcher m = Pattern.compile("(user1=)([a-zA-Z0-9\\ \\ ]+)(\">)").matcher(s);
while(m.find()) {
System.out.println((m.group(2)));
}
Note: There are two different space characters are there in the document

HTML contents different from Google "View page source"

I've read on this page that this has something to see with the user agent used, but I can't find a way to to get the one used by Google.
I'm trying to fet HTML contents from let's say https://www.kayak.fr/flights/TLS-ATH/2019-10-04/2019-10-07?sort=price_a, when I click on "View page source" using Google Chrome, I'm getting prices etc (what I need) but I can't access those with my java code..
Do I have to find the user-agent of my Google Chrome? I found this page but I'm getting the exact same result as before using java..
Any ideas?
Here's my code:
try{
URL url = new URL("https://www.kayak.fr/flights/TLS-ATH/2019-10-04/2019-10-07?sort=price_a");
URLConnection con = url.openConnection();
con.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.117 Safari/537.36");
BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(con.getInputStream(),"UTF-8"));
String line;
while((line = bufferedReader.readLine()) != null){
System.out.println(line);
}
bufferedReader.close();
}catch(IOException e){
e.printStackTrace();
}
The setRequestProperty is really random in this code because I'm still testing.

JSOUP HTTP error fetching URL. Status=403

I tried to scrape the google news content from a specific range of time from 1/1/2016 to 31/12/2016.
Originally, this code worked before. After running several times, it appears the http error.
I don't know whether I set the userclient incorrect or it is blocked by GOOGLE ?
> Exception in thread "main" org.jsoup.HttpStatusException: HTTP error fetching URL. Status=403, URL=http://ipv4.google.com/sorry/index?continue=http://www.google.com/search%253Fq%253Dstackoverflow%2526tbm%253Dnws%2526tbs%253Dcdr%2525253A1%2525252Ccd_min%2525253A5%2525252F30%2525252F2016%2525252Ccd_max%2525253A6%2525252F30%2525252F2016%2526start%253D0&q=EgTKLTckGKH5hsQFIhkA8aeDS-3IYZmr41q-m4rIMh7Uw7vC3wdLMgNyY24
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:679)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:676)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:628)
at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:260)
at org.jsoup.helper.HttpConnection.get(HttpConnection.java:249)
at javaapplication3.JavaApplication3.main(JavaApplication3.java:36)
Code here:
public static void main(String[] args) throws UnsupportedEncodingException, IOException {
String google = "http://www.google.com/search?q=";
String search = "stackoverflow";
String charset = "UTF-8";
String news="&tbm=nws";
String string = google + URLEncoder.encode(search , charset) + news+"&tbs=cdr%3A1%2Ccd_min%3A1%2F1%2F2016%2Ccd_max%3A12%2F31%2F2016";
String userAgent ="Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36";
int numberOfResultpages = 10; // grabs first two pages of search results
for (int i = 0; i < numberOfResultpages; i++) {
Document document = Jsoup.connect(string).userAgent(userAgent) .data("start",""+i).get();
Elements links = document.select( ".r>a");
for (Element link : links) {
String title = link.text();
String url = link.absUrl("href"); // Google returns URLs in format "http://www.google.com/url?q=<url>&sa=U&ei=<someKey>".
url = URLDecoder.decode(url.substring(url.indexOf('=') + 1, url.indexOf('&')), "UTF-8");
if (!url.startsWith("http")) {
continue; // Ads/news/etc.
}
System.out.println("Title: " + title);
System.out.println("URL: " + url);
}
}
}

Handling connection errors and JSoup

I'm trying to create an application to scrape content off of multiple pages on a site. I am using JSoup to connect. This is my code:
for (String locale : langList){
sitemapPath = sitemapDomain+"/"+locale+"/"+sitemapName;
try {
Document doc = Jsoup.connect(sitemapPath)
.userAgent("Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.21 (KHTML, like Gecko) Chrome/19.0.1042.0 Safari/535.21")
.timeout(10000)
.get();
Elements element = doc.select("loc");
for (Element urls : element) {
System.out.println(urls.text());
}
} catch (IOException e) {
System.out.println(e);
}
}
Everything works perfectly most of the time. However there are a few things I want to be able to do.
First off sometimes a 404 status will return or a 500 status maybe a 301. With my code below it will just print the error and move onto the next url. What I would like to be able to do is try to be able to return the url status for all links. If the page connects print a 200, if not print the relevant status code.
Secondly I sometimes catch this error "java.net.SocketTimeoutException: Read timed out" I could increase my timeout however I would prefer to try to connect 3 times, upon failing the 3rd time I want to add the URL to a "failed" array so I can retry the failed connections in the future.
Can someone with more knowledge than me help me out?

The above returns an IOException for me rather than the execute() returning the correct status code.
Using JSoup-1.6.1 I had to change the above code to use ignoreHttpErrors(true).
Now when the code returns the response rather than throwing an exception and you can check the error codes/messages.
Connection.Response response = null;
try {
response = Jsoup.connect(bad_url)
.userAgent("Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.46 Safari/536.5")
.timeout(100000)
.ignoreHttpErrors(true)
.execute();
} catch (IOException e) {
System.out.println("io - "+e);
}
System.out.println("Status code = " + response.statusCode());
System.out.println("Status msg = " + response.statusMessage());
Output:
Status code = 404
Status msg = Not Found

For your first question, you can do your connection/read in two steps, stopping to ask for the status code in the middle like so:
Connection.Response response = Jsoup.connect(sitemapPath)
.userAgent("Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.21 (KHTML, like Gecko) Chrome/19.0.1042.0 Safari/535.21")
.timeout(10000)
.execute();
int statusCode = response.statusCode();
if(statusCode == 200) {
Document doc = connection.get();
Elements element = doc.select("loc");
for (Element urls : element) {
System.out.println(urls.text());
}
}
else {
System.out.println("received error code : " + statusCode);
}
Note that the execute() method will fail with an IOException if it's unable to connect to the server, if the response is malformed HTTP, etc., so you'll need to handle that. However, as long as the server said something that made sense, you'll be able to read the status code and continue. Also, if you've asked Jsoup to follow redirects, you won't be seeing 30x response codes b/c Jsoup will set the status code from the final page fetched.
As for your second question, all you need is a loop around the code sample I just gave you that's wrapped with a try/catch block with SocketTimeoutException. When you catch the exception, the loop should continue. If you're able to get data, then return or break. Shout if you need more help with it!

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Bing Search with Jsoup - how can I avoid captcha? - java

Related

Java 403 Exception When My Bot Tries To Send An Embedded Link

Java I am unable to find more than 1 regex match

HTML contents different from Google "View page source"

JSOUP HTTP error fetching URL. Status=403

Handling connection errors and JSoup

Categories

Resources