Directing the search depths in Crawler4j Solr

Directing the search depths in Crawler4j Solr - java

I am trying to make the crawler "abort" searching a certain subdomain every time it doesn't find a relevant page after 3 consecutive tries. After extracting the title and the text of the page I start looking for the correct pages to submit to my solr collection. (I do not want to add pages that don't match this query)
public void visit(Page page)
{
int docid = page.getWebURL().getDocid();
String url = page.getWebURL().getURL();
String domain = page.getWebURL().getDomain();
String path = page.getWebURL().getPath();
String subDomain = page.getWebURL().getSubDomain();
String parentUrl = page.getWebURL().getParentUrl();
String anchor = page.getWebURL().getAnchor();
System.out.println("Docid: " + docid);
System.out.println("URL: " + url);
System.out.println("Domain: '" + domain + "'");
System.out.println("Sub-domain: '" + subDomain + "'");
System.out.println("Path: '" + path + "'");
System.out.println("Parent page: " + parentUrl);
System.out.println("Anchor text: " + anchor);
System.out.println("ContentType: " + page.getContentType());
if(page.getParseData() instanceof HtmlParseData) {
String title, text;
HtmlParseData theHtmlParseData = (HtmlParseData) page.getParseData();
title = theHtmlParseData.getTitle();
text = theHtmlParseData.getText();
if ( (title.toLowerCase().contains(" word1 ") && title.toLowerCase().contains(" word2 ")) || (text.toLowerCase().contains(" word1 ") && text.toLowerCase().contains(" word2 ")) ) {
//
// submit to SOLR server
//
submit(page);
Header[] responseHeaders = page.getFetchResponseHeaders();
if (responseHeaders != null) {
System.out.println("Response headers:");
for (Header header : responseHeaders) {
System.out.println("\t" + header.getName() + ": " + header.getValue());
}
}
failedcounter = 0;// we start counting for 3 consecutive pages
} else {
failedcounter++;
}
if (failedcounter == 3) {
failedcounter = 0; // we start counting for 3 consecutive pages
int parent = page.getWebURL().getParentDocid();
parent....HtmlParseData.setOutgoingUrls(null);
}
my question is, how do I edit the last line of this code so that i can retrieve the parent "page object" and delete its outgoing urls, so that the crawl moves on to the rest of the subdomains.
Currently i cannot find a function that can get me from the parent id to the page data, for deleting the urls.

The visit(...) method is called as one of the last statements of processPage(...) (line 523 in WebCrawler).
The outgoing links are already added to the crawler's frontier (and might be processed by other crawler processes as soon as they are added).
You could define the behaviour described by adjusting the shouldVisit(...) or (depending on the exact use-case) in shouldFollowLinksIn(...) of the crawler

Related

Randomly changing the JSON Values for every "Post" Request Body using Java

This could be a duplicate question, but I couldn't find my solution anywhere. Hence, posting it.
I am trying to simply POST a request for a Student account Creation Scenario. I do have a JSON file which comprises all the "Keys:Values", required for Student account creation.
This is how the file student_Profile.json looks like:
{
"FirstName":"APi1-Stud-FN",
"MiddleInitial":"Q",
"LastName":"APi1-Stud-LN",
"UserAlternateEmail":"",
"SecretQuestionId":12,
"SecretQuestionAnswer":"Scot",
"UserName":"APi1-stud#xyz.com",
"VerifyUserName":"APi1-stud#xyz.com",
"Password":"A123456",
"VerifyPassword":"A123456",
"YKey":"123xyz",
"YId":6,
"Status":false,
"KeyCode":"",
"SsoUserName":"APi1-stud#xyz.com",
"SsoPassword":"",
"BirthYear":2001
}
So everything on Posting the request from "Rest Assured" point of view looks fine, it's just that I want to update a few values from the above JSON body using JAVA so that I can create a new Student profile every time I run my function and don't have to manually change the Body.
For Every POST Student Account Creation scenario, I need to update the value for
the following keys so that a new test student user account can be created:
First Name
Last Name and
Username // "VerifyUserName" and "SSO UserName" will remain same as user name

I modified the answer to get random values and pass them to json body. random value generation was taken from the accepted answer of this question.
public void testMethod() {
List<String> randomValueList = new ArrayList<>();
for (int i = 0; i < 3; i++) {
String SALTCHARS = "ABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890";
StringBuilder salt = new StringBuilder();
Random rnd = new Random();
while (salt.length() < 18) { // length of the random string.
int index = (int) (rnd.nextFloat() * SALTCHARS.length());
salt.append(SALTCHARS.charAt(index));
}
randomValueList.add(salt.toString());
}
String jsonBody = "{\n" +
" \"FirstName\":\"" + randomValueList.remove(0) + "\",\n" +
" \"MiddleInitial\":\"Q\",\n" +
" \"LastName\":\"" + randomValueList.remove(0) + "\",\n" +
" \"UserAlternateEmail\":\"\",\n" +
" \"SecretQuestionId\":12,\n" +
" \"SecretQuestionAnswer\":\"Scot\",\n" +
" \"UserName\":\"" + randomValueList.remove(0) + " \",\n" +
" \"VerifyUserName\":\"APi1-stud#xyz.com\",\n" +
" \"Password\":\"A123456\",\n" +
" \"VerifyPassword\":\"A123456\",\n" +
" \"YKey\":\"123xyz\",\n" +
" \"YId\":6,\n" +
" \"Status\":false,\n" +
" \"KeyCode\":\"\",\n" +
" \"SsoUserName\":\"APi1-stud#xyz.com\",\n" +
" \"SsoPassword\":\"\",\n" +
" \"BirthYear\":2001\n" +
"}";
Response response = RestAssured
.given()
.body(jsonBody)
.when()
.post("api_url")
.then()
.extract()
.response();
// Do what you need to do with the response body
}

We can used pojo based approach to do certain things very easily . No matter how complex is the payload , serialization and dieselization is the best answer . I have created a framework template for api automation that can we used by putting required POJO's in path :
https://github.com/tanuj-vishnoi/pojo_api_automation
To create pojo, I also have ready to eat food for you :
https://github.com/tanuj-vishnoi/pojo_generator_using_jsonschema2pojo
for the above problem you can refer to the JsonPath lib https://github.com/json-path/JsonPath and use this code:
String mypayload = "{\n" +
" \"FirstName\":\"APi1-Stud-FN\",\n" +
" \"MiddleInitial\":\"Q\",\n" +
" \"LastName\":\"APi1-Stud-LN\"}";
Map map = JsonPath.parse(mypayload).read("$",Map.class);
System.out.println(list);
once the payload converted into map you can change only required values as per the requirement
To generate random strings you can refer to lib org.apache.commons.lang3.RandomStringUtils;
public static String generateUniqueString(int lenghtOfString){
return
RandomStringUtils.randomAlphabetic(lenghtOfString).toLowerCase();
}
I recommend to store payload in a separate file and load it at runtime.

Simple CoreNLP : ClassNotFoundException

My simple coreNLP code is working with main method as shown in code below.
package com.books.servlet;
import edu.stanford.nlp.simple.Document;
import edu.stanford.nlp.simple.Sentence;
public class SimpleCoreNLPDemo {
public static void main(String[] args) {
// Create a document. No computation is done yet.
Document doc = new Document("add your text here! It can contain multiple sentences.");
for (Sentence sent : doc.sentences()) {
// Will iterate over two sentences
// We're only asking for words -- no need to load any models yet
System.out.println("The second word of the sentence '" + sent + "' is " + sent.word(1));
// When we ask for the lemma, it will load and run the part of speech tagger
System.out.println("The third lemma of the sentence '" + sent + "' is " + sent.lemma(2));
// When we ask for the parse, it will load and run the parser
System.out.println("The parse of the sentence '" + sent + "' is " + sent.parse());
}
}
}
Then I used this code in my web application as below. when I execute the code. I get Below error and exception
My web app code
public void me(){
Document doc = new Document("add your text here! It can contain multiple sentences.");
for (Sentence sent : doc.sentences()) {
// Will iterate over two sentences
// We're only asking for words -- no need to load any models yet
System.out.println("The second word of the sentence '" + sent + "' is " + sent.word(1));
// When we ask for the lemma, it will load and run the part of speech tagger
System.out.println("The third lemma of the sentence '" + sent + "' is " + sent.lemma(2));
// When we ask for the parse, it will load and run the parser
System.out.println("The parse of the sentence '" + sent + "' is " + sent.parse());
} }
I have downloaded all the jar files and added to the build path. it is
working fine with main method

I just resolved the problem.
I copied all my Stanford simple NLP jar file to the directory
/WEB-INF/lib
and now my code is working fine. Below is my simple method and its output for your information.
public String s = "I like java and python";
public static Set<String> nounPhrases = new HashSet<>();
public void me() {
Document doc = new Document(" " + s);
for (Sentence sent : doc.sentences()) {
System.out.println("The parse of the sentence '" + sent + "' is " + sent.parse());
}
}
output
The parse of the sentence 'I like java and python' is (ROOT (S (NP (PRP I)) (VP (VBP like) (NP (NN java) (CC and) (NN python)))))

POST data to other URI with URL change in REST (POST and Redirect)

I need to POST data and at the same time redirect to that URL in REST environment. I can do this for normal strings, but the requirement is to POST specific Object.
The way I do it for normal string is -
public Response homePage(#FormParam("username") String username,
#FormParam("passwordhash") String password) {
return Response.ok(PreparePOSTForm(username)).build();
}
private static String PreparePOSTForm(String username)
{
//Set a name for the form
String formID = "PostForm";
String url = "home";
//Build the form using the specified data to be posted.
StringBuilder strForm = new StringBuilder();
strForm.append("<form id=\"" + formID + "\" name=\"" +
formID + "\" action=\"" + url +
"\" method=\"POST\">");
strForm.append("<input type=\"hidden\" name=\"" + "username" +
"\" value=\"" + username + "\">");
strForm.append("</form>");
//Build the JavaScript which will do the Posting operation.
StringBuilder strScript = new StringBuilder();
strScript.append("<script language=\"javascript\">");
strScript.append("var v" + formID + " = document." +
formID + ";");
strScript.append("v" + formID + ".submit();");
strScript.append("</script>");
//Return the form and the script concatenated.
//(The order is important, Form then JavaScript)
return strForm.toString() + strScript.toString();
}
But this method is not sending Objects. I need a work around to send complex Objects. Please help me with this issue.
Thanks in advance.

Crawling a URL in order to extract all the other URLs in that page

I am trying to crawl URLs in order to extract other URLs inside of each URL. To do such, I read the HTML code of the page, read each line of each, match it with a pattern and then extract the needed part as shown below:
public class SimpleCrawler {
static String pattern="https://www\\.([^&]+)\\.(?:com|net|org|)/([^&]+)";
static Pattern UrlPattern = Pattern.compile (pattern);
static Matcher UrlMatcher;
public static void main(String[] args) {
try {
URL url = new URL("https://stackoverflow.com/");
BufferedReader br = new BufferedReader(new InputStreamReader(url.openStream()));
while((String line = br.readLine())!=null){
UrlMatcher= UrlPattern.matcher(line);
if(UrlMatcher.find())
{
String extractedPath = UrlMatcher.group(1);
String extractedPath2 = UrlMatcher.group(2);
System.out.println("http://www."+extractedPath+".com"+extractedPath2);
}
}
} catch (Exception ex) {
ex.printStackTrace();
}
}
}
However, there some issue with it which I would like to address them:
How is it possible to make either http and www or even both of them, optional? I have encountered many cases that there are links without either or both parts, so the regex will not match them.
According to my code, I make two groups, one between http until the domain extension and the second is whatever comes after it. This, however, causes two sub-problems:
2.1 Since it is HTML codes, the rest of the HTML tags that may come after the URL will be extracted to.
2.2 In the System.out.println("http://www."+extractedPath+".com"+extractedPath2); I cannot make sure if it shows right URL (regardless of previous issues) because I do not know which domain extension it is matched with.
Last but not least, I wonder how to match both http and https as well?

How about:
try {
boolean foundMatch = subjectString.matches(
"(?imx)^\n" +
"(# Scheme\n" +
" [a-z][a-z0-9+\\-.]*:\n" +
" (# Authority & path\n" +
" //\n" +
" ([a-z0-9\\-._~%!$&'()*+,;=]+#)? # User\n" +
" ([a-z0-9\\-._~%]+ # Named host\n" +
" |\\[[a-f0-9:.]+\\] # IPv6 host\n" +
" |\\[v[a-f0-9][a-z0-9\\-._~%!$&'()*+,;=:]+\\]) # IPvFuture host\n" +
" (:[0-9]+)? # Port\n" +
" (/[a-z0-9\\-._~%!$&'()*+,;=:#]+)*/? # Path\n" +
" |# Path without authority\n" +
" (/?[a-z0-9\\-._~%!$&'()*+,;=:#]+(/[a-z0-9\\-._~%!$&'()*+,;=:#]+)*/?)?\n" +
" )\n" +
"|# Relative URL (no scheme or authority)\n" +
" ([a-z0-9\\-._~%!$&'()*+,;=#]+(/[a-z0-9\\-._~%!$&'()*+,;=:#]+)*/? # Relative path\n" +
" |(/[a-z0-9\\-._~%!$&'()*+,;=:#]+)+/?) # Absolute path\n" +
")\n" +
"# Query\n" +
"(\\?[a-z0-9\\-._~%!$&'()*+,;=:#/?]*)?\n" +
"# Fragment\n" +
"(\\#[a-z0-9\\-._~%!$&'()*+,;=:#/?]*)?\n" +
"$");
} catch (PatternSyntaxException ex) {
// Syntax error in the regular expression
}

With one library. I used HtmlCleaner. It does the job.
you can find it at:
http://htmlcleaner.sourceforge.net/javause.php
another example (not tested) with jsoup:
http://jsoup.org/cookbook/extracting-data/example-list-links
rather readable.
You can enhance it, choose < A > tags or others, HREF, etc...
or be more precise with case (HreF, HRef, ...): for exercise
import org.htmlcleaner.*;
public static Vector<String> HTML2URLS(String _source)
{
Vector<String> result=new Vector<String>();
HtmlCleaner cleaner = new HtmlCleaner();
// Principal Node
TagNode node = cleaner.clean(_source);
// All nodes
TagNode[] myNodes =node.getAllElements(true);
int s=myNodes.length;
for (int pos=0;pos<s;pos++)
{
TagNode tn=myNodes[pos];
// all attributes
Map<String,String> mss=tn.getAttributes();
// Name of tag
String name=tn.getName();
// Is there href ?
String href="";
if (mss.containsKey("href")) href=mss.get("href");
if (mss.containsKey("HREF")) href=mss.get("HREF");
if (name.equals("a")) result.add(href);
if (name.equals("A")) result.add(href);
}
return result;
}

Java script add element without reload

I have this div in which I add more divs with javascript.
However, everytime I add a new div the to div with javascript, it refreshes so for example a youtube video in one of those divs will stop playing.
Can I put these divs into the div without reloading it?
My current code for putting in thing is
m += "new thing <a> and other stuff </a>"
I NEED it to put the new thing I want without reload.
I currently put them in using href="javascript: addMessage('current time', 'user', 'message')"
The addMessage code:
function addMessage(time, user, msg) {
if (msg == "") {
return false;
}
var m = document.getElementById('message-panel');
m.innerHTML += "<div class='sentMessage'><span class='time'>" + time + "</span><span class='name'><a>" + user + "</a></span><span class='message'>" + msg + "</span></div>";
pageScroll();
if (user != "SERVER") {
if (user != "ERROR") {
playAudio('new-message-sound');
}
}
return false;
}
My only solution to putting new messages in if with href="javascript:addMessage()". I CAN NOT DO ONCLICK="" because I'm using java to controll the javascript!
My javacode for putting in the messages:
public void addMessage(String user, String msg) {
try {
getAppletContext().showDocument(new URL("javascript:addMessage(\"" + Time.now("HH:mm") + "\", \"" + user + "\", \"" + msg + "\")"));
}
catch (MalformedURLException me) {}
}
Thanks in advance, enji

Create the element with document.createElement('tag name here');, then insert it with m.appendChild(newelement);. This leaves any elements before it unaffected.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Directing the search depths in Crawler4j Solr - java

Related

Randomly changing the JSON Values for every "Post" Request Body using Java

Simple CoreNLP : ClassNotFoundException

POST data to other URI with URL change in REST (POST and Redirect)

Crawling a URL in order to extract all the other URLs in that page

Java script add element without reload

Categories

Resources