Web Crawling Using Java Swing

Web Crawling Using Java Swing - java

I'm developing a web crawler Java based. I created a JFrame (Java: Swing). My crawler is running successfully. It's visiting founded link. But i want to add dynamically crawled link in JTextArea but it doesn't. I cannot do that. When i try this my program is freezen. But i can set visited url to console.
My gui like this:
image
My code lines like this:
Document html = null;
try {
html = Jsoup.connect(url).get();
Elements links = html.select("a");
for(Element link: links) {
String tmp = link.attr("abs:href");
jTextArea2.append(tmp + "\n");
if(!this.visitedUrl.contains(tmp)) {
this.foundedUrl.add(tmp);
System.out.println(tmp);
}
}
while(this.foundedUrl.size() > 0) {
String tmp = this.foundedUrl.get(this.foundedUrl.size() - 1);
this.foundedUrl.remove(this.foundedUrl.size() - 1);
if(!this.visitedUrl.contains(tmp)) {
this.linkTracker(tmp);
}
}
How can i set visited url in JTextarea dynamically?

try this:
new Thread((Runnable)() ->
{
Document html = null;
try {
html = Jsoup.connect(url).get();
Elements links = html.select("a");
for(Element link: links) {
String tmp = link.attr("abs:href");
EventQueue.invokeLater(() -> {
jTextArea2.append(tmp + "\n");
});
if(!this.visitedUrl.contains(tmp)) {
this.foundedUrl.add(tmp);
System.out.println(tmp);
}
}
while(this.foundedUrl.size() > 0) {
String tmp = this.foundedUrl.get(this.foundedUrl.size() - 1);
this.foundedUrl.remove(this.foundedUrl.size() - 1);
if(!this.visitedUrl.contains(tmp)) {
this.linkTracker(tmp);
}
}
}catch(Exception e){}
}).start();
The reason you GUI is freezing is because you are blocking the GUI thread. So start your operations on a different thread by creating a new Thread and run from there.
To then get updates call EventQue
It will tell the GUI thread to add the text to the JTextArea

Related

Java XML Getting nodes from node list crashes program

Hello there.
As the title suggests, I currently have an issue in
my program. In the animation loader, I have a method that should
load an animation from a collada file. It gets an Element as an input.
The first thing I do is to collect the animation data. I do this by getting a node list with
NodeList sources = element.getElementsByTagName("source");
And then I iterate through that node list:
for(int i = 0; i < sources.getLength(); i++)
{
// Problem occurs here:
Element sourceElement = (Element) (sources.item(i));
String id = sourceElement.getAttribute("id");
if(id.equals(inputId))
inputSource = FloatArraySource.loadFromElement(sourceElement);
else if(id.equals(outputId))
outputSource = Matrix4fSource.loadFromElement(sourceElement);
else if(id.equals(interpolationId))
interpolationSource = StringArraySource.loadFromElement(sourceElement);
}
The problem occurs on the commented line, and it crashes (only sometimes) with this following exception
Cannot invoke "com.sun.org.apache.xerces.internal.dom.CoreDocumentImpl.changes()" because the return value of "com.sun.org.apache.xerces.internal.dom.NodeImpl.ownerDocument()" is null
I can start the application three times in a row, and it crashes roughly one of four times.
The strangest thing is the fact that it runs perfectly fine in debug mode.
So, I'd be very happy if you could help me out with this issue.
-Budschie
Edit: Some people wanted that I post the full stack trace, so here it is:
Exception in thread "main" java.lang.NullPointerException: Cannot invoke "com.sun.org.apache.xerces.internal.dom.CoreDocumentImpl.changes()" because the return value of "com.sun.org.apache.xerces.internal.dom.NodeImpl.ownerDocument()" is null
at java.xml/com.sun.org.apache.xerces.internal.dom.NodeImpl.changes(NodeImpl.java:1887)
at java.xml/com.sun.org.apache.xerces.internal.dom.DeepNodeListImpl.item(DeepNodeListImpl.java:125)
at java.xml/com.sun.org.apache.xerces.internal.dom.DeepNodeListImpl.getLength(DeepNodeListImpl.java:116)
at de.budschie.engine.assets_management.newcollada.AnimationLoader.loadTransformAnimation(AnimationLoader.java:77)
at de.budschie.engine.assets_management.newcollada.AnimationLoader.loadAnimation(AnimationLoader.java:31)
at de.budschie.engine.assets_management.newcollada.ColladaLoader.loadCollada(ColladaLoader.java:60)
at de.budschie.engine.assets_management.DefaultResourceLoader.loadAll(DefaultResourceLoader.java:75)
at de.budschie.engine.main.MainWindow.gameLoop(MainWindow.java:192)
at de.budschie.engine.main.MainWindow.main(MainWindow.java:81)
Another edit:
Here's the way I load my collada files:
Element colladaTag = null;
try
{
colladaTag = getColladaTag(colladaFile);
} catch (Exception e)
{
// TODO Auto-generated catch block
e.printStackTrace();
}
Element libraryAnimations = (Element) colladaTag.getElementsByTagName("library_animations").item(0);
Element libraryControllers = (Element) colladaTag.getElementsByTagName("library_controllers").item(0);
Element libraryGeometries = (Element) colladaTag.getElementsByTagName("library_geometries").item(0);
NodeList meshesList = null, controllersList = null;
if(libraryGeometries != null)
{
meshesList = libraryGeometries.getElementsByTagName("geometry");
}
if(libraryControllers != null)
{
controllersList = libraryControllers.getElementsByTagName("controller");
}
if(libraryAnimations != null)
{
AnimationLoader.loadAnimation(colladaResult, libraryAnimations);
}
And here's what "getColladaTag()" looks like:
private static Element getColladaTag(String path) throws Exception
{
File file = new File(path);
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
try
{
DocumentBuilder docBuilder = factory.newDocumentBuilder();
Document doc = docBuilder.parse(file);
return doc.getDocumentElement();
}
catch(IOException | SAXException ex)
{
System.out.println("There is a problem with the file that couldn't be fixed.");
ex.printStackTrace();
}
return null;
}
Another small thing I noticed is that sometimes, the JVM itself crashesbecause of an access violation in the string builder...
Very important edit:Whilest debugging I found out that I can't import com.sun.org.apache.xerces.internal.dom.NodeImpl.
My program doesn't throw a ClassNotFoundException though...
So, could that be a reason why the GC is so confused?

You could change
for(int i = 0; i < sources.getLength(); i++)
{
// Problem occurs here:
Element sourceElement = (Element) (sources.item(i));
to
for ( Element sourceElement : sources )
{
which would remove sources.item(i). You could put System.out.println("Index: " + i): just above this line, which would give an indication of how far you get.
It looks like something is modifying the sources container while you are processing it.

private static <T extends YourColladaDataFormat> T loadColladaFile(String pathToXml) throws Exception {
// loads the XML Document, walks through it and returns your workable data model.
}
And then work with T.

Is there way to use assertion in cycle to find all broken images on page

I am using selenium webdriver + TestNG. Help me to solve following issue if possible:
Searching all broken images on page and show them (using assertion) in console after test fails.
The following test fails after first broken image is found, I need test to check all images and show result when it fails:
public class BrokenImagesTest3_ {
#Test
public static void links() throws IOException, StaleElementReferenceException {
System.setProperty("webdriver.chrome.driver", "/C: ...");
WebDriver driver = new ChromeDriver();
driver.manage().window().maximize();
driver.get("https://some url");
driver.manage().timeouts().implicitlyWait(20, TimeUnit.SECONDS);
//Find total Number of links on page and print In console.
List<WebElement> total_images = driver.findElements(By.tagName("img"));
System.out.println("Total Number of images found on page = " + total_images .size());
//for loop to open all links one by one to check response code.
boolean isValid = false;
for (int i = 0; i < total_images .size(); i++) {
String image = total_images .get(i).getAttribute("src");
if (image != null) {
//Call getResponseCode function for each URL to check response code.
isValid = getResponseCode(image);
//Print message based on value of isValid which Is returned by getResponseCode function.
if (isValid) {
System.out.println("Valid image:" + image);
System.out.println("----------XXXX-----------XXXX----------XXXX-----------XXXX----------");
System.out.println();
} else {
System.out.println("Broken image ------> " + image);
System.out.println("----------XXXX-----------XXXX----------XXXX-----------XXXX----------");
System.out.println();
}
} else {
//If <a> tag do not contain href attribute and value then print this message
System.out.println("String null");
System.out.println("----------XXXX-----------XXXX----------XXXX-----------XXXX----------");
System.out.println();
continue;
}
}
driver.close();
}
//Function to get response code of link URL.
//Link URL Is valid If found response code = 200.
//Link URL Is Invalid If found response code = 404 or 505.
public static boolean getResponseCode(String chkurl) {
boolean validResponse = false;
try {
//Get response code of image
HttpClient client = HttpClientBuilder.create().build();
HttpGet request = new HttpGet(chkurl);
HttpResponse response = client.execute(request);
int resp_Code = response.getStatusLine().getStatusCode();
System.out.println(resp_Code);
Assert.assertEquals(resp_Code, 200);
if (resp_Code != 200) {
validResponse = false;
} else {
validResponse = true;
}
} catch (Exception e) {
}
return validResponse;
}
}

The reason your code stops at the first failure is because you are using an Assert for the resp_Code to equal 200. TestNG will stop execution on the first failed assert.
I would do this a little differently. You can use a CSS selector to find only images that contain a src attribute using "img[src]" so you don't have to deal with that case. When I look for broken images, I use the naturalWidth attribute. It will be 0 if the image is broken. Using these two pieces, the code would look like...
List<WebElement> images = driver.findElements(By.cssSelector("img[src]"));
System.out.println("Total Number of images found on page = " + images.size());
int brokenImagesCount = 0;
for (WebElement image : images)
{
if (isImageBroken(image))
{
brokenImagesCount++;
System.out.println(image.getAttribute("outerHTML"));
}
}
System.out.println("Count of broken images: " + brokenImagesCount);
Assert.assertEquals(brokenImagesCount, 0, "Count of broken images is 0");
then add this function
public boolean isImageBroken(WebElement image)
{
return !image.getAttribute("naturalWidth").equals("0");
}
I'm only writing out the images that are broken. I prefer this method since it keeps the log cleaner. Writing image is going to write gibberish that isn't going to be useful so I changed that to write the outerHTML which is the HTML of the IMG tag.

assertEquals() is throwing AssertionError, not an Exception. If codes are not equal in your case it will throw AssertionError and your test will stop and finish as failed.
If you catch Error instead of Exception in your catch() it should probably work as you expect it.

As an addendum to JeffC's, I prefer to collect the erroneous src attributes and report them as the failure rather than logging to a separate file, something like:
List<WebElement> images = driver.findElements(By.cssSelector("img[src]"));
System.out.println("Total Number of images found on page = " + images.size());
StringBuilder brokenImages = new StringBuilder();
for (WebElement image : images)
if (isImageBroken(image))
brokenImages.append(image.getAttribute("src")).append(";");
Assert.assertEquals(brokenImages.getLength(), 0,
"the following images failed to load", brokenImages);
(only an answer as it's easier to explain with code than in a comment)

HTML body returns empty (most of it) when calling from Jsoup [duplicate]

I have a problem using jsoup what I am trying to do is fetch a document from the url which will redirect to another url based on meta refresh url which is not working, to explain clearly if I am entering a website url named http://www.amerisourcebergendrug.com which will automatically redirect to http://www.amerisourcebergendrug.com/abcdrug/ depending upon the meta refresh url but my jsoup is still sticking with http://www.amerisourcebergendrug.com and not redirecting and fetching from http://www.amerisourcebergendrug.com/abcdrug/
Document doc = Jsoup.connect("http://www.amerisourcebergendrug.com").get();
I have also tried using,
Document doc = Jsoup.connect("http://www.amerisourcebergendrug.com").followRedirects(true).get();
but both are not working
Any workaround for this?
Update:
The Page may use meta refresh redirect methods

Update (case insensitive and pretty fault tolerant)
The content parsed (almost) according to spec
The first successfully parsed content meta data should be used
public static void main(String[] args) throws Exception {
URI uri = URI.create("http://www.amerisourcebergendrug.com");
Document d = Jsoup.connect(uri.toString()).get();
for (Element refresh : d.select("html head meta[http-equiv=refresh]")) {
Matcher m = Pattern.compile("(?si)\\d+;\\s*url=(.+)|\\d+")
.matcher(refresh.attr("content"));
// find the first one that is valid
if (m.matches()) {
if (m.group(1) != null)
d = Jsoup.connect(uri.resolve(m.group(1)).toString()).get();
break;
}
}
}
Outputs correctly:
http://www.amerisourcebergendrug.com/abcdrug/
Old answer:
Are you sure that it isn't working. For me:
System.out.println(Jsoup.connect("http://www.ibm.com").get().baseUri());
.. outputs http://www.ibm.com/us/en/ correctly..

to have a better error handling and case sensitivity problem
try
{
Document doc = Jsoup.connect("http://www.ibm.com").get();
Elements meta = doc.select("html head meta");
if (meta != null)
{
String lvHttpEquiv = meta.attr("http-equiv");
if (lvHttpEquiv != null && lvHttpEquiv.toLowerCase().contains("refresh"))
{
String lvContent = meta.attr("content");
if (lvContent != null)
{
String[] lvContentArray = lvContent.split("=");
if (lvContentArray.length > 1)
doc = Jsoup.connect(lvContentArray[1]).get();
}
}
}
// get page title
return doc.title();
}
catch (IOException e)
{
e.printStackTrace();
}

Paste image from clipboard

I'm trying to paste image from clipboard in my website (like copy and paste). Appreciate if anyone could advice on this. Can I achieve this using HTML 5 or applet or any way. Any advice or any link for reference is highly appreciated.

Managed to do it with JavaScript.
JavaScript
if (!window.Clipboard) {
var pasteCatcher = document.createElement("apDiv1");
pasteCatcher.setAttribute("contenteditable", "");
pasteCatcher.style.opacity = 0;
document.body.appendChild(pasteCatcher);
pasteCatcher.focus();
document.addEventListener("click", function() { pasteCatcher.focus(); });
}
window.addEventListener("paste", onPasteHandler);
function onPasteHandler(e)
{
if(e.clipboardData) {
var items = e.clipboardData.items;
if(!items){
alert("Image Not found");
}
for (var i = 0; i < items.length; ++i) {
if (items[i].kind === 'file' && items[i].type === 'image/png') {
var blob = items[i].getAsFile(),
source = window.webkitURL.createObjectURL(blob);
pastedImage = new Image();
pastedImage.src = source;
pasteData();
}
}
}
}
function pasteData()
{
drawCanvas = document.getElementById('drawCanvas1');
ctx = drawCanvas.getContext( '2d' );
ctx.clearRect(0, 0, 640,480);
ctx.drawImage(pastedImage, 0, 0);
}
DIV
<div id="apDiv1" contenteditable='true'>Paste Test</div>

Even if applet is not signed, JNLP API is available.
ClipboardService cs = (ClipboardService)ServiceManager.lookup("javax.jnlp.ClipboardService");
Image c = (Image)cs.getContents().getTransferData(DataFlavor.imageFlavor);

at first, making a file(image) server.
then using js to listen to paste event.
code key word:
addEventListener 'paste' clipboard image
then using ajax upload to the file server. ajax resp the url.
finally making img tag by the url.
applet is out of date... ignore.

Htmlunit getByXPath not returning image tags

I am trying to search all image tags on a specific page. An example page would be www.chapitre.com
I am using the following code to search for all images on the page:
HtmlPage page = HTMLParser.parseHtml(webResponse, webClient.openWindow(null,"testwindow"));
List<?> imageList = page.getByXPath("//img");
ListIterator li = imageList.listIterator();
while (li.hasNext() ) {
HtmlImage image = (HtmlImage)li.next();
URL url = new URL(image.getSrcAttribute());
//For now, only load 1X1 pixels
if (image.getHeightAttribute().equals("1") && image.getWidthAttribute().equals("1")) {
System.out.println("This is an image: " + url + " from page " + webRequest.getUrl() );
}
}
This doesn't return me all the image tags in the page. For example, an image tag with attributes "src="http://ace-lb.advertising.com/site=703223/mnum=1516/bins=1/rich=0/logs=0/betr=A2099=[+]LP2" width="1" height="1"" should be captured, but its not. Am I doing something wrong here?
Any help is really appreciated.
Cheers!

That's because
URL url = new URL(image.getSrcAttribute());
Is throwing you an exception :)
Try this code:
public Main() throws Exception {
WebClient webClient = new WebClient();
webClient.setJavaScriptEnabled(false);
HtmlPage page = webClient.getPage("http://www.chapitre.com");
List<HtmlImage> imageList = (List<HtmlImage>) page.getByXPath("//img");
for (HtmlImage image : imageList) {
try {
new URL(image.getSrcAttribute());
if (image.getHeightAttribute().equals("1") && image.getWidthAttribute().equals("1")) {
System.out.println(image.getSrcAttribute());
}
} catch (Exception e) {
System.out.println("You didn't see this comming :)");
}
}
}
You can even get those 1x1 pixel images by xpath.
Hope this helps.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Web Crawling Using Java Swing - java

Related

Java XML Getting nodes from node list crashes program

Is there way to use assertion in cycle to find all broken images on page

HTML body returns empty (most of it) when calling from Jsoup [duplicate]

Paste image from clipboard

Htmlunit getByXPath not returning image tags

Categories

Resources