Why does the crawler4j example give an error? [closed] - java

This question is unlikely to help any future visitors; it is only relevant to a small geographic area, a specific moment in time, or an extraordinarily narrow situation that is not generally applicable to the worldwide audience of the internet. For help making this question more broadly applicable, visit the help center.
Closed 9 years ago.
I'm trying to use the Basic crawler example in crawler4j. I took the code from the crawler4j website here.
package edu.crawler;
import edu.uci.ics.crawler4j.crawler.Page;
import edu.uci.ics.crawler4j.crawler.WebCrawler;
import edu.uci.ics.crawler4j.parser.HtmlParseData;
import edu.uci.ics.crawler4j.url.WebURL;
import java.util.List;
import java.util.regex.Pattern;
import org.apache.http.Header;
public class MyCrawler extends WebCrawler {
private final static Pattern FILTERS = Pattern.compile(".*(\\.(css|js|bmp|gif|jpe?g" + "|png|tiff?|mid|mp2|mp3|mp4"
+ "|wav|avi|mov|mpeg|ram|m4v|pdf" + "|rm|smil|wmv|swf|wma|zip|rar|gz))$");
/**
* You should implement this function to specify whether the given url
* should be crawled or not (based on your crawling logic).
*/
#Override
public boolean shouldVisit(WebURL url) {
String href = url.getURL().toLowerCase();
return !FILTERS.matcher(href).matches() && href.startsWith("http://www.ics.uci.edu/");
}
/**
* This function is called when a page is fetched and ready to be processed
* by your program.
*/
#Override
public void visit(Page page) {
int docid = page.getWebURL().getDocid();
String url = page.getWebURL().getURL();
String domain = page.getWebURL().getDomain();
String path = page.getWebURL().getPath();
String subDomain = page.getWebURL().getSubDomain();
String parentUrl = page.getWebURL().getParentUrl();
String anchor = page.getWebURL().getAnchor();
System.out.println("Docid: " + docid);
System.out.println("URL: " + url);
System.out.println("Domain: '" + domain + "'");
System.out.println("Sub-domain: '" + subDomain + "'");
System.out.println("Path: '" + path + "'");
System.out.println("Parent page: " + parentUrl);
System.out.println("Anchor text: " + anchor);
if (page.getParseData() instanceof HtmlParseData) {
HtmlParseData htmlParseData = (HtmlParseData) page.getParseData();
String text = htmlParseData.getText();
String html = htmlParseData.getHtml();
List<WebURL> links = htmlParseData.getOutgoingUrls();
System.out.println("Text length: " + text.length());
System.out.println("Html length: " + html.length());
System.out.println("Number of outgoing links: " + links.size());
}
Header[] responseHeaders = page.getFetchResponseHeaders();
if (responseHeaders != null) {
System.out.println("Response headers:");
for (Header header : responseHeaders) {
System.out.println("\t" + header.getName() + ": " + header.getValue());
}
}
System.out.println("=============");
}
}
Above is the code for the crawler class from the example.
public class Controller {
public static void main(String[] args) throws Exception {
String crawlStorageFolder = "../data/";
int numberOfCrawlers = 7;
CrawlConfig config = new CrawlConfig();
config.setCrawlStorageFolder(crawlStorageFolder);
/*
* Instantiate the controller for this crawl.
*/
PageFetcher pageFetcher = new PageFetcher(config);
RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);
CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer);
/*
* For each crawl, you need to add some seed urls. These are the first
* URLs that are fetched and then the crawler starts following links
* which are found in these pages
*/
controller.addSeed("http://www.ics.uci.edu/~welling/");
controller.addSeed("http://www.ics.uci.edu/~lopes/");
controller.addSeed("http://www.ics.uci.edu/");
/*
* Start the crawl. This is a blocking operation, meaning that your code
* will reach the line after this only when crawling is finished.
*/
controller.start(MyCrawler.class, numberOfCrawlers);
}
}
Above is the class for the controller class for the web crawler.
When I try to run the Controller class from my IDE (Intellij) I get the following error:
Exception in thread "main" java.lang.UnsupportedClassVersionError: edu/uci/ics/crawler4j/crawler/CrawlConfig : Unsupported major.minor version 51.0
Is there something about the maven config that is found here that I should know? Do I have to use a different version or something?

The problem wasn't with crawler4j. The problem was that the version of Java that I was using was different from the latest version of Java that is used in crawler4j. I switched the version right before they updated to Java 7 and everything worked fine. I'm guessing that upgrading my version of Java to 7 would have the same effect.

Related

Connect to websocket to consuming amazon chime API in real time

I want to expand my software, written in JavaFX, with Amazon Chime API to consume its messaging. I know there's JS SDK that allows establish messaging websocket session with no problems. But in java SDK there're no related classes. So I want to use STOMP library to consuming the websocket endpoint.
At the time I am struggling with making correct request, namely with signing AWS request (calculating X-AMZ-Signature)
According to the post I'm trying to calculate correct X-AMZ-Signature request parameter. Here's the class:
#Slf4j
#Service
public class Aws4Signer {
private final static String REQUEST_CONTENT_TYPE = "application/json";
private final static String AUTH_ALGORITHM = "AWS4-HMAC-SHA256";
private final static String REQUEST_METHOD = "GET";
#Data
class AuthenticationData {
#NonNull
String timestamp;
#NonNull
String date;
#NonNull
String authorizationHeader;
}
private AppConfig appConfig = new AppConfig();
/**
* Gets the timestamp in YYYYMMDD'T'HHMMSS'Z' format, which is the required
* format for AWS4 signing request headers and credential string
*
* #param dateTime
* an OffsetDateTime object representing the UTC time of current
* signing request
* #return the formatted timestamp string
*
* #see <a href=
* "https://docs.aws.amazon.com/general/latest/gr/sigv4-signed-request-examples.html">
* Examples of the Complete Version 4 Signing Process (Python)</a>
*/
public String getTimeStamp(OffsetDateTime dateTime) {
DateTimeFormatter formatter = DateTimeFormatter.ofPattern("yyyyMMdd'T'HHmmss'Z'");
String formatDateTime = dateTime.format(formatter);
return formatDateTime;
}
/**
* Gets the date string in yyyyMMdd format, which is required to build the
* credential scope string
*
* #return the formatted date string
*/
public String getDate(OffsetDateTime dateTime) {
DateTimeFormatter formatter = DateTimeFormatter.ofPattern("yyyyMMdd");
String formatDateTime = dateTime.format(formatter);
return formatDateTime;
}
public byte[] generateAws4SigningKey(String timestamp) {
String secretKey = appConfig.getAwsAuthConfig().getSecretKey();
String regionName = appConfig.getAwsAuthConfig().getServiceRegion();
String serviceName = appConfig.getAwsAuthConfig().getServiceName();
byte[] signatureKey = null;
try {
signatureKey = Aws4SignatureKeyGenerator.generateSignatureKey(secretKey, timestamp, regionName,
serviceName);
} catch (Exception e) {
log.error("An error has ocurred when generate signature key: " + e, e);
}
return signatureKey;
}
/**
* Builds an {#link AuthenticationData} object containing the timestamp, date,
* payload hash and the AWS4 signature
* <p>
*
* The signing logic was translated from the Python implementation, see this
* link for more details: <a href=
* "https://docs.aws.amazon.com/general/latest/gr/sigv4-signed-request-examples.html">Examples
* of the Complete Version 4 Signing Process (Python)</a>
*
* #param target
* #param requestBody
*
* #return
* #throws NoSuchAlgorithmException
* #throws UnsupportedEncodingException
* #throws InvalidKeyException
* #throws SignatureException
* #throws IllegalStateException
*
*/
public AuthenticationData buildAuthorizationData() throws NoSuchAlgorithmException,
UnsupportedEncodingException, InvalidKeyException, SignatureException, IllegalStateException {
log.info("predict - start");
// Starting building the lengthy signing data
AwsAuthConfig awsAuthConfig = appConfig.getAwsAuthConfig();
String payloadHash = Hmac.getSha256Hash(requestBody);
OffsetDateTime now = OffsetDateTime.now(ZoneOffset.UTC);
String timestamp = getTimeStamp(now);
String date = getDate(now);
// Step 1 is to define the verb (GET, POST, etc.) -- already done by defining
// constant REQUEST_METHOD
// Step 2: Create canonical URI--the part of the URI from domain to query
// string (use '/' if no path)
String canonical_uri = "/connect";
// Step 3: Create the canonical query string. In this example, request
// parameters are passed in the body of the request and the query string
// is blank.
String canonical_querystring = buildCanonicalQueryString();
// Step 4: Create the canonical headers. Header names must be trimmed
// and lowercase, and sorted in code point order from low to high.
// Note that there is a trailing \n.
String canonical_headers = "content-type:" + REQUEST_CONTENT_TYPE + "\n"
+ "host:" + awsAuthConfig.getServiceHost() + "\n"
+ "x-amz-date:" + timestamp + "\n";
String signed_headers = "content-type;host;x-amz-date";
log.debug("canonical_headers : {}", canonical_headers);
String canonical_request = REQUEST_METHOD + "\n" + canonical_uri + "\n" + canonical_querystring + "\n"
+ canonical_headers + "\n" + signed_headers;
log.debug("canonical_request : {}", canonical_request);
String credential_scope = date + "/" + awsAuthConfig.getServiceRegion() + "/" + awsAuthConfig.getServiceName()
+ "/" + "aws4_request";
String canonical_request_hash = Hmac.getSha256Hash(canonical_request);
log.debug("canonical_request_hash : {}", canonical_request_hash);
String string_to_sign = AUTH_ALGORITHM + "\n" + timestamp + "\n" + credential_scope + "\n"
+ canonical_request_hash;
log.debug("string_to_sign : {}", string_to_sign);
byte[] sigKey = generateAws4SigningKey(date);
String signature = Hmac.calculateHMAC(string_to_sign, sigKey, Hmac.HMAC_SHA256);
String authorization_header = AUTH_ALGORITHM + " " + "Credential=" + awsAuthConfig.getAccessKey() + "/"
+ credential_scope + ", " + "SignedHeaders=" + signed_headers + ", " + "Signature=" + signature;
log.debug("authorization_header : {}", authorization_header);
return new AuthenticationData(timestamp, date, authorization_header);
}
private String buildCanonicalQueryString() {
String canonicalRequest = REQUEST_METHOD + "\n" +
"/connect" + "\n" +
"X-Amz-Algorithm=AWS4-HMAC-SHA256\n" +
"&X-Amz-Credential=MYACCESKEY%2F"+ getDate(OffsetDateTime.now()) + "%2Fus-east-1%2Fchime%2Faws4_request\n" +
"&X-Amz-Date=" + getTimeStamp(OffsetDateTime.now()) +"\n" +
"&X-Amz-Expires=10\n" +
"&X-Amz-SignedHeaders=host\n" +
"&sessionId=" + UUID.randomUUID() +"\n" +
"&userArn=" + "MYUSERARN";
return canonicalRequest;
}
}
Provided information
host: node001.ue1.ws-messaging.chime.aws
service name: chime
region: us-east-1
It makes the signature, and I'm trying use it via postman, but postman can't connect to the endpoint node001.ue1.ws-messaging.chime.aws/connect, just saying 'connect ETIMEDOUT 54.162.103.101:80'.
I'm new to Amazon, so it's kinda hard to me. Can you say where am I wrong?
Any help appreciated!
Wrote fully working code for signing URL for connecting to chime websocket. Hope this will helps somebody!

How to make Amazon AWS API call from Java?

What are my options if I want to make a call to Amazon AWS Rest API from Java.
When implementing my own request, generating the AWS4-HMAC-SHA256 Authorization header would be the hardest.
Essentially, this is the header I need to generate:
Authorization: AWS4-HMAC-SHA256 Credential=AKIAJTOUYS27JPVRDUYQ/20200602/us-east-1/route53/aws4_request, SignedHeaders=host;x-amz-content-sha256;x-amz-date, Signature=ba85affa19fa4a8735ce952e50d41c8c93406a11d22b88cc98b109b529bcc15e
Not saying that this is a complete list, but I would consider using established libraries like:
Official AWS SDK v1, or v2 - current and comprehensive but depends on netty.io and many other jars.
Apache JClouds - depends on JAXB which is not longer a part of JDK but now available at maven central separately.
But sometimes, all you want is to make a simple call, and you don't want to bring many dependencies into your application. You may want to implement the rest call yourself. Generating the right AWS Authorization header is the hardest bit to implement.
Here is the code to do that in pure java OpenJDK with no external dependencies.
It implements Amazon AWS API Signature Version 4 signing process.
AmazonRequestSignatureV4Utils.java
package com.frusal.amazonsig4;
import java.nio.charset.StandardCharsets;
import java.security.MessageDigest;
import java.util.ArrayList;
import java.util.List;
import java.util.Map;
import java.util.Map.Entry;
import java.util.stream.Collectors;
import javax.crypto.Mac;
import javax.crypto.spec.SecretKeySpec;
public class AmazonRequestSignatureV4Utils {
/**
* Generates signing headers for HTTP request in accordance with Amazon AWS API Signature version 4 process.
* <p>
* Following steps outlined here: docs.aws.amazon.com
* <p>
* Simple usage example is here: {#link AmazonRequestSignatureV4Example}
* <p>
* This method takes many arguments as read-only, but adds necessary headers to #{code headers} argument, which is a map.
* The caller should make sure those parameters are copied to the actual request object.
* <p>
* The ISO8601 date parameter can be created by making a call to:<br>
* - {#code java.time.format.DateTimeFormatter.ofPattern("yyyyMMdd'T'HHmmss'Z'").format(ZonedDateTime.now(ZoneOffset.UTC))}<br>
* or, if you prefer joda:<br>
* - {#code org.joda.time.format.ISODateTimeFormat.basicDateTimeNoMillis().print(DateTime.now().withZone(DateTimeZone.UTC))}
*
* #param method - HTTP request method, (GET|POST|DELETE|PUT|...), e.g., {#link java.net.HttpURLConnection#getRequestMethod()}
* #param host - URL host, e.g., {#link java.net.URL#getHost()}.
* #param path - URL path, e.g., {#link java.net.URL#getPath()}.
* #param query - URL query, (parameters in sorted order, see the AWS spec) e.g., {#link java.net.URL#getQuery()}.
* #param headers - HTTP request header map. This map is going to have entries added to it by this method. Initially populated with
* headers to be included in the signature. Like often compulsory 'Host' header. e.g., {#link java.net.HttpURLConnection#getRequestProperties()}.
* #param body - The binary request body, for requests like POST.
* #param isoDateTime - The time and date of the request in ISO8601 basic format, see comment above.
* #param awsIdentity - AWS Identity, e.g., "AKIAJTOUYS27JPVRDUYQ"
* #param awsSecret - AWS Secret Key, e.g., "I8Q2hY819e+7KzBnkXj66n1GI9piV+0p3dHglAzQ"
* #param awsRegion - AWS Region, e.g., "us-east-1"
* #param awsService - AWS Service, e.g., "route53"
*/
public static void calculateAuthorizationHeaders(
String method, String host, String path, String query, Map<String, String> headers,
byte[] body,
String isoDateTime,
String awsIdentity, String awsSecret, String awsRegion, String awsService
) {
try {
String bodySha256 = hex(sha256(body));
String isoJustDate = isoDateTime.substring(0, 8); // Cut the date portion of a string like '20150830T123600Z';
headers.put("Host", host);
headers.put("X-Amz-Content-Sha256", bodySha256);
headers.put("X-Amz-Date", isoDateTime);
// (1) https://docs.aws.amazon.com/general/latest/gr/sigv4-create-canonical-request.html
List<String> canonicalRequestLines = new ArrayList<>();
canonicalRequestLines.add(method);
canonicalRequestLines.add(path);
canonicalRequestLines.add(query);
List<String> hashedHeaders = new ArrayList<>();
for (Entry<String, String> e : headers.entrySet()) {
hashedHeaders.add(e.getKey().toLowerCase());
canonicalRequestLines.add(e.getKey().toLowerCase() + ":" + normalizeSpaces(e.getValue().toString()));
}
canonicalRequestLines.add(null); // new line required after headers
String signedHeaders = hashedHeaders.stream().collect(Collectors.joining(";"));
canonicalRequestLines.add(signedHeaders);
canonicalRequestLines.add(bodySha256);
String canonicalRequestBody = canonicalRequestLines.stream().map(line -> line == null ? "" : line).collect(Collectors.joining("\n"));
String canonicalRequestHash = hex(sha256(canonicalRequestBody.getBytes(StandardCharsets.UTF_8)));
// (2) https://docs.aws.amazon.com/general/latest/gr/sigv4-create-string-to-sign.html
List<String> strignToSignLines = new ArrayList<>();
strignToSignLines.add("AWS4-HMAC-SHA256");
strignToSignLines.add(isoDateTime);
String credentialScope = isoJustDate + "/" + awsRegion + "/" + awsService + "/aws4_request";
strignToSignLines.add(credentialScope);
strignToSignLines.add(canonicalRequestHash);
String stringToSign = strignToSignLines.stream().collect(Collectors.joining("\n"));
// (3) https://docs.aws.amazon.com/general/latest/gr/sigv4-calculate-signature.html
byte[] kDate = hmac(("AWS4" + awsSecret).getBytes(StandardCharsets.UTF_8), isoJustDate);
byte[] kRegion = hmac(kDate, awsRegion);
byte[] kService = hmac(kRegion, awsService);
byte[] kSigning = hmac(kService, "aws4_request");
String signature = hex(hmac(kSigning, stringToSign));
String authParameter = "AWS4-HMAC-SHA256 Credential=" + awsIdentity + "/" + credentialScope + ", SignedHeaders=" + signedHeaders + ", Signature=" + signature;
headers.put("Authorization", authParameter);
} catch (Exception e) {
if (e instanceof RuntimeException) {
throw (RuntimeException) e;
} else {
throw new IllegalStateException(e);
}
}
}
private static String normalizeSpaces(String value) {
return value.replaceAll("\\s+", " ").trim();
}
public static String hex(byte[] a) {
StringBuilder sb = new StringBuilder(a.length * 2);
for(byte b: a) {
sb.append(String.format("%02x", b));
}
return sb.toString();
}
private static byte[] sha256(byte[] bytes) throws Exception {
MessageDigest digest = MessageDigest.getInstance("SHA-256");
digest.update(bytes);
return digest.digest();
}
public static byte[] hmac(byte[] key, String msg) throws Exception {
Mac mac = Mac.getInstance("HmacSHA256");
mac.init(new SecretKeySpec(key, "HmacSHA256"));
return mac.doFinal(msg.getBytes(StandardCharsets.UTF_8));
}
}
And the usage example:
AmazonRequestSignatureV4Utils.java
package com.frusal.amazonsig4;
import java.net.HttpURLConnection;
import java.net.URL;
import java.nio.charset.StandardCharsets;
import java.time.ZoneOffset;
import java.time.ZonedDateTime;
import java.time.format.DateTimeFormatter;
import java.util.LinkedHashMap;
import java.util.Map;
import java.util.stream.Collectors;
public class AmazonRequestSignatureV4Example {
public static void main(String[] args) throws Exception {
String route53HostedZoneId = "Z08118721NNU878C4PBNA";
String awsIdentity = "AKIAJTOUYS27JPVRDUYQ";
String awsSecret = "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";
String awsRegion = "us-east-1";
String awsService = "route53";
URL url = new URL("https://route53.amazonaws.com/2013-04-01/hostedzone/" + route53HostedZoneId + "/rrset");
HttpURLConnection connection = (HttpURLConnection) url.openConnection();
connection.setRequestMethod("POST");
System.out.println(connection.getRequestMethod() + " " + url);
String body = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n" +
"<ChangeResourceRecordSetsRequest xmlns=\"https://route53.amazonaws.com/doc/2013-04-01/\">\n" +
"<ChangeBatch>\n" +
// " <Comment>optional comment about the changes in this change batch request</Comment>\n" +
" <Changes>\n" +
" <Change>\n" +
" <Action>UPSERT</Action>\n" +
" <ResourceRecordSet>\n" +
" <Name>c001cxxx.frusal.com.</Name>\n" +
" <Type>A</Type>\n" +
" <TTL>300</TTL>\n" +
" <ResourceRecords>\n" +
" <ResourceRecord>\n" +
" <Value>157.245.232.185</Value>\n" +
" </ResourceRecord>\n" +
" </ResourceRecords>\n" +
// " <HealthCheckId>optional ID of a Route 53 health check</HealthCheckId>\n" +
" </ResourceRecordSet>\n" +
" </Change>\n" +
" </Changes>\n" +
"</ChangeBatch>\n" +
"</ChangeResourceRecordSetsRequest>";
byte[] bodyBytes = body.getBytes(StandardCharsets.UTF_8);
Map<String, String> headers = new LinkedHashMap<>();
String isoDate = DateTimeFormatter.ofPattern("yyyyMMdd'T'HHmmss'Z'").format(ZonedDateTime.now(ZoneOffset.UTC));
AmazonRequestSignatureV4Utils.calculateAuthorizationHeaders(
connection.getRequestMethod(),
connection.getURL().getHost(),
connection.getURL().getPath(),
connection.getURL().getQuery(),
headers,
bodyBytes,
isoDate,
awsIdentity,
awsSecret,
awsRegion,
awsService);
// Unsigned headers
headers.put("Content-Type", "text/xml; charset=utf-8"); // I guess it get modified somewhere on the way... Let's just leave it out of the signature.
// Log headers and body
System.out.println(headers.entrySet().stream().map(e -> e.getKey() + ": " + e.getValue()).collect(Collectors.joining("\n")));
System.out.println(body);
// Send
headers.forEach((key, val) -> connection.setRequestProperty(key, val));
connection.setDoOutput(true);
connection.getOutputStream().write(bodyBytes);
connection.getOutputStream().flush();
int responseCode = connection.getResponseCode();
System.out.println("connection.getResponseCode()=" + responseCode);
String responseContentType = connection.getHeaderField("Content-Type");
System.out.println("responseContentType=" + responseContentType);
System.out.println("Response BODY:");
if (connection.getErrorStream() != null) {
System.out.println(new String(connection.getErrorStream().readAllBytes(), StandardCharsets.UTF_8));
} else {
System.out.println(new String(connection.getInputStream().readAllBytes(), StandardCharsets.UTF_8));
}
}
}
And the trace it would generate:
POST https://route53.amazonaws.com/2013-04-01/hostedzone/Z08118721NNU878C4PBNA/rrset
Host: route53.amazonaws.com
X-Amz-Content-Sha256: 46c7521da55bcf9e99fa6e12ec83997fab53128b5df0fb12018a6b76fb2bf891
X-Amz-Date: 20200602T035618Z
Authorization: AWS4-HMAC-SHA256 Credential=AKIAJTOUYS27JPVRDUYQ/20200602/us-east-1/route53/aws4_request, SignedHeaders=host;x-amz-content-sha256;x-amz-date, Signature=6a59090f837cf71fa228d2650e9b82e9769e0ec13e9864e40bd2f81c682ef8cb
Content-Type: text/xml; charset=utf-8
<?xml version="1.0" encoding="UTF-8"?>
<ChangeResourceRecordSetsRequest xmlns="https://route53.amazonaws.com/doc/2013-04-01/">
<ChangeBatch>
<Changes>
<Change>
<Action>UPSERT</Action>
<ResourceRecordSet>
<Name>c001cxxx.frusal.com.</Name>
<Type>A</Type>
<TTL>300</TTL>
<ResourceRecords>
<ResourceRecord>
<Value>157.245.232.185</Value>
</ResourceRecord>
</ResourceRecords>
</ResourceRecordSet>
</Change>
</Changes>
</ChangeBatch>
</ChangeResourceRecordSetsRequest>
connection.getResponseCode()=200
responseContentType=text/xml
Response BODY:
<?xml version="1.0"?>
<ChangeResourceRecordSetsResponse xmlns="https://route53.amazonaws.com/doc/2013-04-01/"><ChangeInfo><Id>/change/C011827119UYGF04GVIP6</Id><Status>PENDING</Status><SubmittedAt>2020-06-02T03:56:25.822Z</SubmittedAt></ChangeInfo></ChangeResourceRecordSetsResponse>
For the latest version of this code, please see java-amazon-request-signature-v4 repository at GitHub.

Directing the search depths in Crawler4j Solr

I am trying to make the crawler "abort" searching a certain subdomain every time it doesn't find a relevant page after 3 consecutive tries. After extracting the title and the text of the page I start looking for the correct pages to submit to my solr collection. (I do not want to add pages that don't match this query)
public void visit(Page page)
{
int docid = page.getWebURL().getDocid();
String url = page.getWebURL().getURL();
String domain = page.getWebURL().getDomain();
String path = page.getWebURL().getPath();
String subDomain = page.getWebURL().getSubDomain();
String parentUrl = page.getWebURL().getParentUrl();
String anchor = page.getWebURL().getAnchor();
System.out.println("Docid: " + docid);
System.out.println("URL: " + url);
System.out.println("Domain: '" + domain + "'");
System.out.println("Sub-domain: '" + subDomain + "'");
System.out.println("Path: '" + path + "'");
System.out.println("Parent page: " + parentUrl);
System.out.println("Anchor text: " + anchor);
System.out.println("ContentType: " + page.getContentType());
if(page.getParseData() instanceof HtmlParseData) {
String title, text;
HtmlParseData theHtmlParseData = (HtmlParseData) page.getParseData();
title = theHtmlParseData.getTitle();
text = theHtmlParseData.getText();
if ( (title.toLowerCase().contains(" word1 ") && title.toLowerCase().contains(" word2 ")) || (text.toLowerCase().contains(" word1 ") && text.toLowerCase().contains(" word2 ")) ) {
//
// submit to SOLR server
//
submit(page);
Header[] responseHeaders = page.getFetchResponseHeaders();
if (responseHeaders != null) {
System.out.println("Response headers:");
for (Header header : responseHeaders) {
System.out.println("\t" + header.getName() + ": " + header.getValue());
}
}
failedcounter = 0;// we start counting for 3 consecutive pages
} else {
failedcounter++;
}
if (failedcounter == 3) {
failedcounter = 0; // we start counting for 3 consecutive pages
int parent = page.getWebURL().getParentDocid();
parent....HtmlParseData.setOutgoingUrls(null);
}
my question is, how do I edit the last line of this code so that i can retrieve the parent "page object" and delete its outgoing urls, so that the crawl moves on to the rest of the subdomains.
Currently i cannot find a function that can get me from the parent id to the page data, for deleting the urls.
The visit(...) method is called as one of the last statements of processPage(...) (line 523 in WebCrawler).
The outgoing links are already added to the crawler's frontier (and might be processed by other crawler processes as soon as they are added).
You could define the behaviour described by adjusting the shouldVisit(...) or (depending on the exact use-case) in shouldFollowLinksIn(...) of the crawler

Crawling a URL in order to extract all the other URLs in that page

I am trying to crawl URLs in order to extract other URLs inside of each URL. To do such, I read the HTML code of the page, read each line of each, match it with a pattern and then extract the needed part as shown below:
public class SimpleCrawler {
static String pattern="https://www\\.([^&]+)\\.(?:com|net|org|)/([^&]+)";
static Pattern UrlPattern = Pattern.compile (pattern);
static Matcher UrlMatcher;
public static void main(String[] args) {
try {
URL url = new URL("https://stackoverflow.com/");
BufferedReader br = new BufferedReader(new InputStreamReader(url.openStream()));
while((String line = br.readLine())!=null){
UrlMatcher= UrlPattern.matcher(line);
if(UrlMatcher.find())
{
String extractedPath = UrlMatcher.group(1);
String extractedPath2 = UrlMatcher.group(2);
System.out.println("http://www."+extractedPath+".com"+extractedPath2);
}
}
} catch (Exception ex) {
ex.printStackTrace();
}
}
}
However, there some issue with it which I would like to address them:
How is it possible to make either http and www or even both of them, optional? I have encountered many cases that there are links without either or both parts, so the regex will not match them.
According to my code, I make two groups, one between http until the domain extension and the second is whatever comes after it. This, however, causes two sub-problems:
2.1 Since it is HTML codes, the rest of the HTML tags that may come after the URL will be extracted to.
2.2 In the System.out.println("http://www."+extractedPath+".com"+extractedPath2); I cannot make sure if it shows right URL (regardless of previous issues) because I do not know which domain extension it is matched with.
Last but not least, I wonder how to match both http and https as well?
How about:
try {
boolean foundMatch = subjectString.matches(
"(?imx)^\n" +
"(# Scheme\n" +
" [a-z][a-z0-9+\\-.]*:\n" +
" (# Authority & path\n" +
" //\n" +
" ([a-z0-9\\-._~%!$&'()*+,;=]+#)? # User\n" +
" ([a-z0-9\\-._~%]+ # Named host\n" +
" |\\[[a-f0-9:.]+\\] # IPv6 host\n" +
" |\\[v[a-f0-9][a-z0-9\\-._~%!$&'()*+,;=:]+\\]) # IPvFuture host\n" +
" (:[0-9]+)? # Port\n" +
" (/[a-z0-9\\-._~%!$&'()*+,;=:#]+)*/? # Path\n" +
" |# Path without authority\n" +
" (/?[a-z0-9\\-._~%!$&'()*+,;=:#]+(/[a-z0-9\\-._~%!$&'()*+,;=:#]+)*/?)?\n" +
" )\n" +
"|# Relative URL (no scheme or authority)\n" +
" ([a-z0-9\\-._~%!$&'()*+,;=#]+(/[a-z0-9\\-._~%!$&'()*+,;=:#]+)*/? # Relative path\n" +
" |(/[a-z0-9\\-._~%!$&'()*+,;=:#]+)+/?) # Absolute path\n" +
")\n" +
"# Query\n" +
"(\\?[a-z0-9\\-._~%!$&'()*+,;=:#/?]*)?\n" +
"# Fragment\n" +
"(\\#[a-z0-9\\-._~%!$&'()*+,;=:#/?]*)?\n" +
"$");
} catch (PatternSyntaxException ex) {
// Syntax error in the regular expression
}
With one library. I used HtmlCleaner. It does the job.
you can find it at:
http://htmlcleaner.sourceforge.net/javause.php
another example (not tested) with jsoup:
http://jsoup.org/cookbook/extracting-data/example-list-links
rather readable.
You can enhance it, choose < A > tags or others, HREF, etc...
or be more precise with case (HreF, HRef, ...): for exercise
import org.htmlcleaner.*;
public static Vector<String> HTML2URLS(String _source)
{
Vector<String> result=new Vector<String>();
HtmlCleaner cleaner = new HtmlCleaner();
// Principal Node
TagNode node = cleaner.clean(_source);
// All nodes
TagNode[] myNodes =node.getAllElements(true);
int s=myNodes.length;
for (int pos=0;pos<s;pos++)
{
TagNode tn=myNodes[pos];
// all attributes
Map<String,String> mss=tn.getAttributes();
// Name of tag
String name=tn.getName();
// Is there href ?
String href="";
if (mss.containsKey("href")) href=mss.get("href");
if (mss.containsKey("HREF")) href=mss.get("HREF");
if (name.equals("a")) result.add(href);
if (name.equals("A")) result.add(href);
}
return result;
}

Java - Updating static variables

I have two classes in java that need to run at the same time - A Crawler class ( that basically implements a web crawler, and keeps printing out urls as it encounters them ), and an Indexer class, which as of now, is supposed to simply print the urls crawled.
For this, my Indexer class has a Queue :
public static Queue<String> urls = new LinkedList();
And in the toVisit() function of my Crawler class, I have the following :
Indexer.urls.add( url ) // where url is a String
The Crawler is working totally fine, since it prints out all the urls that it has encountered, but for some reason, these urls do not get added to the Queue in my Indexer class. Any idea why this may be the case ?
The toVisit() method from Crawler.java is as follows :
public void visit(Page page) {
int docid = page.getWebURL().getDocid();
String url = page.getWebURL().getURL();
String domain = page.getWebURL().getDomain();
String path = page.getWebURL().getPath();
String subDomain = page.getWebURL().getSubDomain();
String parentUrl = page.getWebURL().getParentUrl();
System.out.println("Docid: " + docid);
System.out.println("URL: " + url);
System.out.println("Domain: '" + domain + "'");
System.out.println("Sub-domain: '" + subDomain + "'");
System.out.println("Path: '" + path + "'");
System.out.println("Parent page: " + parentUrl);
Indexer.urls.add( url );
System.out.println("=============");
}
Code from my Indexer class :
public static Queue<String> urls = new LinkedList();
public static void main( String[] args )
{
while( urls.isEmpty() )
{
//System.out.println("Empty send queue");
Thread.sleep(sleepTime);
}
System.out.println( urls.poll() );
}
Okay, so I solved my problem by doing as suggested by BigMike. I implemented the Runnable interface in my two classes, and then ran those 2 classes as threads within the main function of a new third class.
Thanks everyone for all your help ! :)

Categories