web crawler in java. downloading web page issue

web crawler in java. downloading web page issue - java

I am trying to develop a small web crawler, which downloads the web pages and search for links in a specific section. But when i am running this code, links in "href" tag are getting shortened. like :
original link : "/kids-toys-action-figures-accessories/b/ref=toys_hp_catblock_actnfigs?ie=UTF8&node=165993011&pf_rd_m=ATVPDKIKX0DER&pf_rd_s=merchandised-search-4&pf_rd_r=267646F4BB25430BAD0D&pf_rd_t=101&pf_rd_p=1582921042&pf_rd_i=165793011"
turned into : "/kids-toys-action-figures-accessories/b?ie=UTF8&node=165993011"
can anybody help me please. below is my code :
package test;
import java.io.*;
import java.net.MalformedURLException;
import java.util.*;
public class myFirstWebCrawler {
public static void main(String[] args) {
String strTemp = "";
String dir="d:/files/";
String filename="hello.txt";
String fullname=dir+filename;
try {
URL my_url = new URL("http://www.amazon.com/s/ref=lp_165993011_ex_n_1?rh=n%3A165793011&bbn=165793011&ie=UTF8&qid=1376550433");
BufferedReader br = new BufferedReader(new InputStreamReader(my_url.openStream(),"utf-8"));
createdir(dir);
while(null != (strTemp = br.readLine())){
writetofile(fullname,strTemp);
System.out.println(strTemp);
}
System.out.println("index of feature category : " + readfromfile(fullname,"Featured Categories"));
} catch (Exception ex) {
ex.printStackTrace();
}
}
public static void createdir(String dirname)
{ File d= new File(dirname);
d.mkdirs();
}
public static void writetofile(String path, String bbyte)
{
try
{
FileWriter filewriter = new FileWriter(path,true);
BufferedWriter bufferedWriter = new BufferedWriter(filewriter);
bufferedWriter.write(bbyte);
bufferedWriter.newLine();
bufferedWriter.close();
}
catch(IOException e)
{System.out.println("Error");}
}
public static int readfromfile(String path, String key)
{
String dir="d:/files/";
String filename="hello1.txt";
String fullname=dir+filename;
linksAndAt[] linksat=new linksAndAt[10];
BufferedReader bf = null;
try {
bf = new BufferedReader(new FileReader(path));
} catch (FileNotFoundException e1) {
e1.printStackTrace();
}
String currentLine;
int index =-1;
try{
Runtime.getRuntime().exec("cls");
while((currentLine = bf.readLine()) != null)
{
index=currentLine.indexOf(key);
if(index>0)
{
writetofile(fullname,currentLine);
int count=0;
int lastIndex=0;
while(lastIndex != -1)
{
lastIndex=currentLine.indexOf("href=\"",lastIndex);
if(lastIndex != -1)
{
lastIndex+="href=\"".length();
StringBuilder sb = new StringBuilder();
while(currentLine.charAt(lastIndex) != '\"')
{
sb.append(Character.toString(currentLine.charAt(lastIndex)));
lastIndex++;
}
count++;
System.out.println(sb);
}
}
System.out.println("\n count : " + count);
return index;
}
}
}
catch(FileNotFoundException f)
{
f.printStackTrace();
System.out.println("Error");
}
catch(IOException e)
{try {
bf.close();
} catch (IOException e1) {
e1.printStackTrace();
}}
return index;}
}

This feels to me like a situation where the server app is responding differently to requests from your desktop browser and your Java-based crawler. That could be because your browser is passing cookies in its requests which your Java-based crawler is not (such as session-persisting cookies), or it could be because your desktop browser passes a different User-Agent header than your crawler does, or it could be because other request headers are different between your desktop browser and your Java crawler.
When writing crawling apps, this is one of the biggest issues one runs into: it's easy to forget that the same URL requested by different clients won't always respond with the same code. Not sure if that's what's happening to you here, but it's very common.

Related

Selenium: Opening a Downloaded File in Java

I need to open and read a downloaded file using selenium and I'm not quite sure how to do it. I see answers that suggests to download the file in a selected location. Does my code really need to start from downloading the file to selected location or can it start directly after downloading?
After opening the file I must also read it. Can anyone give me an idea on how to do this? Thank you!

You can read file using following code :
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
public class ReadFileExample1 {
private static final String FILENAME = "E:\\test\\filename.txt";
public static void main(String[] args) {
BufferedReader br = null;
FileReader fr = null;
try {
fr = new FileReader(FILENAME);
br = new BufferedReader(fr);
String sCurrentLine;
br = new BufferedReader(new FileReader(FILENAME));
while ((sCurrentLine = br.readLine()) != null) {
System.out.println(sCurrentLine);
}
} catch (IOException e) {
e.printStackTrace();
} finally {
try {
if (br != null)
br.close();
if (fr != null)
fr.close();
} catch (IOException ex) {
ex.printStackTrace();
}
}
}
}
Hope it will help you.

You can use this line of code to handel download a file from chrome and forefox browser.
public static File waitForDownloadToComplete(File downloadPath, String fileName) throws Exception {
boolean isFileFound = false;
int waitCounter = 0;
while (!isFileFound) {
logger.info("Waiting For Download To Complete....");
for (File tempFile : downloadPath.listFiles()) {
if (tempFile.getName().contains(fileName)) {
String tempEx = FilenameUtils.getExtension(tempFile.getName());
// crdownload - For Chrome, part - For Firefox
if (tempEx.equalsIgnoreCase("crdownload") || tempEx.equalsIgnoreCase("part")) {
Thread.sleep(1000);
} else {
isFileFound = true;
logger.info("Download To Completed....");
return tempFile;
}
}
}
Thread.sleep(1000);
waitCounter++;
if (waitCounter > 25) {
isFileFound = true;
}
}
throw new Exception("File Not Downloaded");
}
}

Reading txt file online

Consider following
Code
private String url = "https://celestrak.org/NORAD/elements/resource.txt";
#Override
public Boolean crawl() {
try {
// Timeout is set to 20s
Connection connection = Jsoup.connect(url).userAgent(USER_AGENT).timeout(20 * 1000);
Document htmlDocument = connection.get();
// 200 is the HTTP OK status code
if (connection.response().statusCode() == 200) {
System.out.println("\n**Visiting** Received web page at " + url);
} else {
System.out.println("\n**Failure** Web page not recieved at " + url);
return Boolean.FALSE;
}
if (!connection.response().contentType().contains("text/plain")) {
System.out.println("**Failure** Retrieved something other than plain text");
return Boolean.FALSE;
}
System.out.println(htmlDocument.text()); // Here it print whole text file in one line
} catch (IOException ioe) {
// We were not successful in our HTTP request
System.err.println(ioe);
return Boolean.FALSE;
}
return Boolean.TRUE;
}
Output
SCD 1 1 22490U 93009B 16329.83043855 .00000228 00000-0 12801-4 0 9993 2 22490 24.9691 122.2579 0043025 337.9285 169.5838 14.44465946256021 TECHSAT 1B (GO-32) 1 25397U ....
I am trying to read an online-txt file (from https://celestrak.org/NORAD/elements/resource.txt). Problem is that while I print or save the body's text it prints whole online-txt file in one line. But I want to read it as splited by \n so that I can read it line by line. Am I making mistake while reading online-txt file?
I am using JSoup.

you can do it without using jsoup in the following manner:
public static void main(String[] args) {
String data;
try {
data = IOUtils.toString(new URL("https://celestrak.com/NORAD/elements/resource.txt"));
for (String line : data.split("\n")) {
System.out.println(line);
}
} catch (IOException e1) {
e1.printStackTrace();
}
}
the above code uses org.apache.commons.io.IOUtils
if adding the commons library is a issue you can use the below code:
public static void main(String[] args) {
URLReader reader;
try {
reader = new URLReader(new URL("https://celestrak.com/NORAD/elements/resource.txt"));
BufferedReader bufferedReader = new BufferedReader(reader);
String sCurrentLine;
while ((sCurrentLine = bufferedReader.readLine()) != null) {
System.out.println(sCurrentLine);
}
bufferedReader.close();
} catch (MalformedURLException e1) {
e1.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}

Since the file is already delimited by line separator, we can simple take the input stream from URL to read the contents
String url = "https://celestrak.com/NORAD/elements/resource.txt";
List<String> text = new BufferedReader(new InputStreamReader(new URL(url).openStream())).lines().collect(Collectors.toList());
To convert to a String
String content = new BufferedReader(new InputStreamReader(new URL(url).openStream())).lines()
.collect(Collectors.joining(System.getProperty("line.separator")));

Get html code from an internet page opened in my browser

What I want to do is to open an internet page in my browser (chrome) and get the html source code of the page just opened with my java application.
I don't want to get the source code of an url, I want a program that connects to the browser and gets the html code of the page that is open.
For example, if I open youtube in my browser, I want my application to get the current pages html code (in that case youtube code). Sorry if my english is not very good.

You can do this:
import java.util.*;
public static void main(String[] args) {
Scanner input = new Scanner(System.in);
URL url;
InputStream is = null;
BufferedReader br;
String line;
try {
String urlInput = input.nextLine();
url = new URL(urlInput);
is = url.openStream(); // throws an IOException
br = new BufferedReader(new InputStreamReader(is));
while ((line = br.readLine()) != null) {
System.out.println(line);
}
} catch (MalformedURLException mue) {
mue.printStackTrace();
} catch (IOException ioe) {
ioe.printStackTrace();
} finally {
try {
if (is != null) is.close();
} catch (IOException ioe) {
// nothing to see here
}
}
}
I got this from here: How do you Programmatically Download a Webpage in Java

Try this out:
You must pass in the URL as the argument and you'll have the HTML code
public static void main(String[] args) throws IOException {
URL u = null;
try {
u = new URL(args[0]);
} catch (MalformedURLException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
BufferedReader in = new BufferedReader(new InputStreamReader(u.openStream()));
String line = null;
while((line = in.readLine()) != null){
System.out.print(line);
}
}

Best way to read a text file line by line, taking each line and putting it into the code

First of all, I am but a lowly web-programmer so have very little experience with actual programming.
I have been given a list of 30,000 urls and I am not going to waste my time clicking each one to check if they are valid - is there a way to read through the text file that they are in and have a program check each line?
The code I currently have is in java as really that's all I know so if there's a better language again, please let me know.
Here is what I have so far:
public class UrlCheck {
public static void main(String[] args) throws IOException {
URL url = new URL("http://www.google.com");
//Need to change this to make it read from text file
try {
InputStream inp = null;
try {
inp = url.openStream();
} catch (UnknownHostException ex) {
System.out.println("Invalid");
}
if (inp != null) {
System.out.println("Valid");
}
} catch (MalformedURLException exc) {
exc.printStackTrace();
}
}
}

First you read the file line by line using a BufferedReader and check each line. Below code should work. It is upto you to decide what to do when you encounter an invalid URL. You could just print it as I showed or write to another file.
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.io.InputStream;
import java.net.MalformedURLException;
import java.net.URL;
import java.rmi.UnknownHostException;
public class UrlCheck {
public static void main(String[] args) throws IOException {
BufferedReader br = new BufferedReader(new FileReader("_filename"));
String line;
while ((line = br.readLine()) != null) {
if(checkUrl(line)) {
System.out.println("URL " + line + " was OK");
} else {
System.out.println("URL " + line + " was not VALID"); //handle error as you like
}
}
br.close();
}
private static boolean checkUrl(String pUrl) throws IOException {
URL url = new URL(pUrl);
//Need to change this to make it read from text file
try {
InputStream inp = null;
try {
inp = url.openStream();
} catch (UnknownHostException ex) {
System.out.println("Invalid");
return false;
}
if (inp != null) {
System.out.println("Valid");
return true;
}
} catch (MalformedURLException exc) {
exc.printStackTrace();
return false;
}
return true;
}
}
The checkUrl method can be simplified as below as well
private static boolean checkUrl(String pUrl) {
URL url = null;
InputStream inp = null;
try {
url = new URL(pUrl);
inp = url.openStream();
return inp != null;
} catch (IOException e) {
e.printStackTrace();
return false;
} finally {
try {
if (inp != null) {
inp.close();
}
} catch (IOException e) {
e.printStackTrace();
}
}
}

You could just use httpURLConnection. If it is not valid you won't get anything back.
HttpURLConnection connection = null;
try{
URL myurl = new URL("http://www.myURL.com");
connection = (HttpURLConnection) myurl.openConnection();
//Set request to header to reduce load
connection.setRequestMethod("HEAD");
int code = connection.getResponseCode();
System.out.println("" + code);
} catch {
//Handle invalid URL
}

I am unsure of your experience but a multi-threaded solution is possible here. As you read through the text file store the urls in a thread-safe structure and allow a number of threads to go and attempt to open these connections. This will make for a more efficient solution as it may take a while to test the 30000 urls while you are reading them in.
Check out a producer-consumer example if you are unsure:
http://www.journaldev.com/1034/java-blockingqueue-example-implementing-producer-consumer-problem

public class UrlCheck {
public static void main(String[] args) {
try {
URL url = new URL("http://www.google.com");
//Open the Http connection
HttpURLConnection connection = (HttpURLConnection) url.openConnection();
//Get the http response code
int responceCode = connection.getResponseCode();
if (responceCode == HttpURLConnection.HTTP_OK) //if the http response code is 200 OK so the url is valid
{
System.out.println("Valid");
} else //Else the url is not valid
{
System.out.println("Invalid");
}
} catch (MalformedURLException ex) {
System.out.println("Invalid");
} catch (IOException ex) {
System.out.println("Invalid");
}
}
}

Why does the integer written to a file get read as a different value?

I've got a program where I need to generate an integer, write it to a text file and read it back the next time the program runs. After some anomalous behavior, I've stripped it down to setting an integer value, writing it to a file and reading it back for debugging.
totScore, is set to 25 and when I print to the console prior to writing to the file, I see a value of 25. However, when I read the file and print to the console I get three values...25, 13, and 10. Viewing the text file in notepad gives me a character not on the keyboard, so I suspect that the file is being stored in something other that int.
Why do I get different results from my write and read steps?
Is it not being written as an int? How are these values being stored in the file? Do I need to cast the read value as something else and convert it to an integer?
Consider:
import javax.swing.*;
import java.awt.*;
import java.awt.event.*;
import java.io.*;
import java.nio.file.*;
import java.nio.file.StandardOpenOption.*;
//
public class HedgeScore {
public static void main(String[] args) {
int totScore = 25;
OutputStream outStream = null; ///write
try {
System.out.println("totscore="+totScore);
BufferedWriter bw = new BufferedWriter(new FileWriter(new File("hedgescore.txt")));
bw.write(totScore);
bw.write(System.getProperty("line.separator"));
bw.flush();
bw.close();
}
catch(IOException f) {
System.out.println(f.getMessage());
}
try {
InputStream input = new FileInputStream("hedgescore.txt");
int data = input.read();
while(data != -1) {
System.out.println("data being read from file :"+ data);
data = input.read();
int prevScore = data;
}
input.close();
}
catch(IOException f) {
System.out.println(f.getMessage());
}
}
}

You're reading/writing Strings and raw data, but not being consistent. Why not instead read in Strings (using a Reader of some sort) and then convert to int by parsing the String? Either that or write out your data as bytes and read it in as bytes -- although that can get quite tricky if the file must deal with different types of data.
So either:
import java.io.*;
public class HedgeScore {
private static final String FILE_PATH = "hedgescore.txt";
public static void main(String[] args) {
int totScore = 25;
BufferedWriter bw = null;
try {
System.out.println("totscore=" + totScore);
bw = new BufferedWriter(new FileWriter(new File(
FILE_PATH)));
bw.write(totScore);
bw.write(System.getProperty("line.separator"));
bw.flush();
} catch (IOException f) {
System.out.println(f.getMessage());
} finally {
if (bw != null) {
try {
bw.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
InputStream input = null;
try {
input = new FileInputStream(FILE_PATH);
int data = 0;
while ((data = input.read()) != -1) {
System.out.println("data being read from file :" + data);
}
input.close();
} catch (IOException f) {
System.out.println(f.getMessage());
} finally {
if (input != null) {
try {
input.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
}
or:
import java.io.*;
public class HedgeScore2 {
private static final String FILE_PATH = "hedgescore.txt";
public static void main(String[] args) {
int totScore = 25;
PrintWriter pw = null;
try {
System.out.println("totscore=" + totScore);
pw = new PrintWriter(new FileWriter(new File(FILE_PATH)));
pw.write(String.valueOf(totScore));
pw.write(System.getProperty("line.separator"));
pw.flush();
} catch (IOException f) {
System.out.println(f.getMessage());
} finally {
if (pw != null) {
pw.close();
}
}
BufferedReader reader = null;
try {
reader = new BufferedReader(new FileReader(FILE_PATH));
String line = null;
while ((line = reader.readLine()) != null) {
System.out.println(line);
}
} catch (IOException f) {
System.out.println(f.getMessage());
} finally {
if (reader != null) {
try {
reader.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

web crawler in java. downloading web page issue - java

Related

Selenium: Opening a Downloaded File in Java

Reading txt file online

Get html code from an internet page opened in my browser

Best way to read a text file line by line, taking each line and putting it into the code

Why does the integer written to a file get read as a different value?

Categories

Resources