Why views does not increase when java opens the pages? - java

I have a code which uses tor every time to get a new IP address, and then it opens a blog page, but then also the views counter of the blog do not increases?
import java.io.InputStream;
import java.net.*;
public class test {
public static void main (String args [])throws Exception {
System.out.println (test.getData("http://checkip.amazonaws.com"));
System.out.println (test.getData("***BLOG URL***"));
}
public static String getData(String ur) throws Exception {
String TOR_IP="127.0.0.1", TOR_PORT="9050";
System.setProperty("java.net.preferIPv4Stack" , "true");
System.setProperty("socksProxyHost", TOR_IP);
System.setProperty("socksProxyPort", TOR_PORT);
URL url = new URL(ur);
String s = "";
URLConnection c = url.openConnection();
c.connect();
InputStream i = c.getInputStream();
int j ;
while ((j = i.read()) != -1) {
s+=(char)j;
}
return s;
}
}
This I just made to understand what they have to pass this little auto script.

This is an evolving field, the blog sites try to detect and thwart cheating. Wordpress in particular excludes (https://en.support.wordpress.com/stats/):
visits from browsers that do not execute javascript or load images
In other words just hitting the page doesn't count. You need to fetch all the resources and possibly execute the JavaScript as well.

Related

Search strings on google through Java and submit

I'm trying to make a program that submits a search query to Google and then opens the browser with the results.
I have managed to connect to Google but I'm stuck because I don't know how to insert the search query into the URL and submit it.
I have tried to use HtmlUnit but it doesn't seem to work.
This is the code so far:
URL url = new URL("http://google.com");
HttpURLConnection hr = (HttpURLConnection) url.openConnection();
System.out.println(hr.getResponseCode());
String str = "search from java!";
You can use the Java.net package to browse the internet. I have used an additional method to create the search query for google to replace the spaces with %20 for the URL address
public static void main(String[] args) {
URI uri= null;
String googleUrl = "https://www.google.com/search?q=";
String searchQuery = createQuery("search from Java!");
String query = googleUrl + searchQuery;
try {
uri = new URI(query);
Desktop.getDesktop().browse(uri);
} catch (IOException | URISyntaxException e) {
e.printStackTrace();
}
}
private static String createQuery(String query) {
query = query.replaceAll(" ", "%20");
return query;
}
The packages used are core java:
import java.awt.Desktop;
import java.net.URI;
import java.net.URISyntaxException;

Best way to get Amazon page and product information

I want to get Amazon page and product information from their website so I work on a future project. I have no experience with APIs but also saw that I would need to pay in order to use Amazon's. My current plan was to use a WebRequest class which basically takes down the page's raw text and then parse through it to get what I need. It pulls down HTML from all the websites I have tried except amazon. When I try and use it for amazon I get text like this...
??èv~-1?½d!Yä90û?¡òk6??ªó?l}L??A?{í??j?ì??ñF Oü?ª[D ú7W¢!?É?L?]â  v??ÇJ???t?ñ?j?^,Y£>O?|?I`OöN??Q?»bÇJPy1·¬Ç??RtâU??Q%vB??^íè|??ª?
Can someone explain to me why this happens? Or even better if you could point me towards a better way of doing this? Any help is appreciated.
This is the class I mentioned...
public class WebRequest {
protected String url;
protected ArrayList<String> pageText;
public WebRequest() {
url = "";
pageText = new ArrayList<String>();
}
public WebRequest(String url) {
this.url = url;
pageText = new ArrayList<String>();
load();
}
public boolean load() {
boolean returnValue = true;
try {
URL thisURL = new URL(url);
BufferedReader reader = new BufferedReader(new InputStreamReader(thisURL.openStream()));
String line;
while ((line = reader.readLine()) != null) {
pageText.add(line);
}
reader.close();
}
catch (Exception e) {
returnValue = false;
System.out.println("peepee");
}
return returnValue;
}
public boolean load(String url) {
this.url = url;
return load();
}
public String toString() {
String returnString = "";
for (String s : pageText) {
returnString += s + "\n";
}
return returnString;
}
}
It could be that the page is returned using a different character encoding than your platform default. If that's the case, you should specify the appropriate encoding, e.g:
new InputStreamReader(thisURL.openStream(), "UTF-8")
But that data doesn't look like character data at all to me. It's too random. It looks like binary data. Are you sure you're not downloading an image by mistake?
If you want to make more sophisticated HTTP requests, there are quite a few Java libraries, e.g. OkHttp and AsyncHttpClient.
But it's worth bearing in mind that Amazon probably doesn't like people scraping its site, and will have built in detection of malicious or unwanted activity. It might be sending you gibberish on purpose to deter you from continuing. You should be careful because some big sites may block your IP temporarily or permanently.
My advice would be to learn how to use the Amazon APIs. They're pretty powerful—and you won't get yourself banned.

How do I fetch a different url from a page in java?

I am working on a program to download the first 100 comics from the XKCD website, however the URL for XKCD differs from the image url. For the sake of ease, I was wondering if there was a simple way to grab the URL for the image after going to the XKCD URL. Here is my code:
public class XKCD {
public static void saveImage(String imageUrl, int i) throws IOException {
URL url = new URL(imageUrl);
String fileName = url.getFile();
String destName = i + fileName.substring(fileName.lastIndexOf("/"));
System.out.println(destName);
InputStream is = url.openStream();
OutputStream os = new FileOutputStream(destName);
byte[] b = new byte[2048];
int length;
while ((length = is.read(b)) != -1) {
os.write(b, 0, length);
}
is.close();
os.close();
}
public static void main(String[] args) throws MalformedURLException,
IOException {
for(int i=1;i<=100;i++){
saveImage("https://xkcd.com/"+i+"/", i);
}
}
XKCD Has a JSON API: https://xkcd.com/about/
Is there an interface for automated systems to access comics and metadata?
Yes. You can get comics through the JSON interface, at URLs like http://xkcd.com/info.0.json (current comic) and http://xkcd.com/614/info.0.json (comic #614).
Here is a good java JSON library: https://github.com/stleary/JSON-java
REALLY easy to use, I have used it a lot.
So if you have the text from xkcd.com/info.0.json in txt, you say:
import org.json.*;
JSONObject obj=new JSONObject(txt);
String url=obj.getString("img");
String titleText=obj.getString("alt");
int year=Integer.parseInt(obj.getString("year"));
int num=Integer.parseInt(obj.getString("num"));
int month=Integer.parseInt(obj.getString("month"));
int day=Integer.parseInt(obj.getString("day"));
String title=obj.getString("title");
Image img=downloadImageOrWhateverYouDoWithTheImageURL(url);
This should work.
I suggest using JSOUP for that. It can produce an absolute URL from a relative link:
You can import the library into your project using:
<!-- https://mvnrepository.com/artifact/org.jsoup/jsoup -->
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.11.2</version>
</dependency>
And you can get the absolute path of the image using simply code like this:
public static void main(String[] args) throws IOException {
Document document = Jsoup.connect("https://xkcd.com/").get();
Elements links = document.select("img");
links.stream()
.map(link -> link.absUrl("src"))
.filter(str -> str.contains("/comics"))
.forEach(System.out::println);
}
If you run this code you will see the image URL printed out on the console:
https://imgs.xkcd.com/comics/river_border.png
The issue here is that you're calling saveImage method not with image, but page URL.
Get the page itself, then parse regex from such example string :
"Image URL (for hotlinking/embedding): https://imgs.xkcd.com/comics/barrel_cropped_(1).jpg"

how to check and record URL redirction?

I writing some web-spider now. I want to crawl a bunch of pages from the web. I have succeed part of my goal, with hundreds of URL link stored on my hand. But those links are not the final link. That means, when you put a URL in a web browser like Google Chrome, the URL would be automatically redirected to another page, which is what I want. But that only work in a web browser. When I write code to crawl from that URL, redirection would not happen.
Some example:
given (URL_1):
http://weixin.sogou.com/websearch/art.jsp?sg=CBf80b2xkgZ8cxz1-SgG-dBH_4QL8uVunUQKxf0syVWvynE5nPZm2TPqNuEF6MO2xv0MclVANfsVYUGr5-1b3ls29YYxgU27ra8qaaU15iv7KVkBsZp5Td27Cb2A24cIwEuw__0ZHdPeivmW-kcfnw..&url=p0OVDH8R4SHyUySb8E88hkJm8GF_McJfBfynRTbN8wjVuWMLA31KxFCrZAW0lIGG1EpZGR0F1jdIzWnvINEMaGQ3JxMQ33742MRcPWmNX2CMTFYIzOo-v8LrDlfP2AnF54peD-GxvCNYy-5x5In7jJFmExjqCxhpkyjFvwP6PuGcQ64lGQ2ZDMuqxplQrsbk
put this link in a browser, it would be automatically redirect to (URL_2):
http://mp.weixin.qq.com/s?__biz=MzA4OTIxOTA4Nw==&mid=404672464&idx=1&sn=bdfff50b8e9ac28739cf8f8a51976b03&3rd=MzA3MDU4NTYzMw==&scene=6#rd
which is a different link.
But put this in python code like:
response=urllib2.urlopen(URL_1)
print response.read()
that auto-redirection does't happen!
In a word, my question is: given a URL, how to get the redirected one ?
Some body give me some java code, which work in some other situation, but doesn't help in mine:
import java.net.HttpURLConnection;
import java.net.URL;
public class Main {
public void test()throws Exception {
String expectedURL ="http://www.zhihu.com/question/20583607/answer/16597802";
String url = "http://www.baidu.com/link?url=ByBJLpHsj5nXx6DESXbmMjIrU5W4Eh0yg5wCQpe3kCQMlJK_RJBmdEYGm0DDTCoTDGaz7rH80gxjvtvoqJuYxK";
String redirtURL = getRedirectURL(url);
if (redirtURL.equals(expectedURL)) {
System.out.println("Equal");
}else{
System.out.println(url);
System.out.println(redirtURL);
}
}
public String getRedirectURL(String path) throws Exception {
HttpURLConnection conn = (HttpURLConnection) new URL(path).openConnection();
conn.setInstanceFollowRedirects(false);
conn.setConnectTimeout(5000);
return conn.getHeaderField("Location");
}
public static void main(String[] args) throws Exception{
Main obj = new Main();
obj.test();
}
}
It would print out Equal in this case, which mean that we can now get expecteURL from url. But this would work in the former case.( I don't know why, but looking carefully in to the URL_1 above and that url in the java code, I notice that there is some interesting difference: there is a snippet .../link?url=... in the url in above java code , which would probably means some direction. But in the URL_1 above, it is .../art.jsp?sg=... )
Look for follow_redirects option. In python, you can do it e.g. with requests
import requests
response = requests.get('http://example.com', follow_redirects=True)
print response.url
# history contains list of responses for redirects
print response.history

Determining parameters on crawler4j

I am trying to use crawler4j like it was shown to be used in this example and no matter how I define the number of crawlers or change the root folder I continue to get this error from the code stating:
"Needed parameters:
rootFolder (it will contain intermediate crawl data)
numberOfCralwers (number of concurrent threads)"
The main code is below:
public class Controller {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.out.println("Needed parameters: ");
System.out.println("\t rootFolder (it will contain intermediate crawl data)");
System.out.println("\t numberOfCralwers (number of concurrent threads)");
return;
}
/*
* crawlStorageFolder is a folder where intermediate crawl data is
* stored.
*/
String crawlStorageFolder = args[0];
/*
* numberOfCrawlers shows the number of concurrent threads that should
* be initiated for crawling.
*/
int numberOfCrawlers = Integer.parseInt(args[1]);
There was a similar question asking exactly what I want to know here , but I didn't quite understand the solution, like where I was to type java BasicCrawler Controller "arg1" "arg2" . I am running this code on Eclipse and I am still fairly new to the world of programming. I would really appreciate it if someone helped me understand this problem
If you aren't giving any arguments when you are running the file, you will get that error.
Put the following as comment sin your code or delete it.
if (args.length != 2) {
System.out.println("Needed parameters: ");
System.out.println("\t rootFolder (it will contain intermediate crawl data)");
System.out.println("\t numberOfCralwers (number of concurrent threads)");
return;
}
And after that set your root folder to the one where you want to store the meta data.
To use crawler4j in your project you must create two classes. One of the them is CrawlController (Which start crawler according to the parameters) and the other is Crawler.
Just run the main method in the Controller class and see crawled pages
Here is Controller.java file:
import edu.uci.ics.crawler4j.crawler.CrawlConfig;
import edu.uci.ics.crawler4j.crawler.CrawlController;
import edu.uci.ics.crawler4j.fetcher.PageFetcher;
import edu.uci.ics.crawler4j.robotstxt.RobotstxtConfig;
import edu.uci.ics.crawler4j.robotstxt.RobotstxtServer;
public class Controller {
public static void main(String[] args) throws Exception {
RobotstxtConfig robotstxtConfig2 = new RobotstxtConfig();
System.out.println(robotstxtConfig2.getCacheSize());
System.out.println(robotstxtConfig2.getUserAgentName());
String crawlStorageFolder = "/crawler/testdata";
int numberOfCrawlers = 4;
CrawlConfig config = new CrawlConfig();
config.setCrawlStorageFolder(crawlStorageFolder);
PageFetcher pageFetcher = new PageFetcher(config);
RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
System.out.println(robotstxtConfig.getCacheSize());
System.out.println(robotstxtConfig.getUserAgentName());
RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);
CrawlController controller = new CrawlController(config,
pageFetcher, robotstxtServer);
controller.addSeed("http://cyesilkaya.wordpress.com/");
controller.start(Crawler.class, numberOfCrawlers);
}
}
Here is Crawler.java file:
import java.io.IOException;
import edu.uci.ics.crawler4j.crawler.Page;
import edu.uci.ics.crawler4j.crawler.WebCrawler;
import edu.uci.ics.crawler4j.url.WebURL;
public class Crawler extends WebCrawler {
#Override
public boolean shouldVisit(WebURL url) {
// you can write your own filter to decide crawl the incoming URL or not.
return true;
}
#Override
public void visit(Page page) {
String url = page.getWebURL().getURL();
try {
String url = page.getWebURL().getURL();
System.out.println("URL: " + url);
}
catch (IOException e) {
}
}
}
In Eclipse :
->Click on run
->Click on run configurations...
In the pop-up window :
First, left column: make sure that your application is selected in sub-dir Java Application, else create a new (Click on new).
Then in the central Window, go on "Arguments"
Write your arguments under "Program arguments" Once you have written your first argument press enter for the seconde arguments, and so on... (=newline because args is a [ ])
Then click Apply
And click Run.

Categories