I have two classes in java that need to run at the same time - A Crawler class ( that basically implements a web crawler, and keeps printing out urls as it encounters them ), and an Indexer class, which as of now, is supposed to simply print the urls crawled.
For this, my Indexer class has a Queue :
public static Queue<String> urls = new LinkedList();
And in the toVisit() function of my Crawler class, I have the following :
Indexer.urls.add( url ) // where url is a String
The Crawler is working totally fine, since it prints out all the urls that it has encountered, but for some reason, these urls do not get added to the Queue in my Indexer class. Any idea why this may be the case ?
The toVisit() method from Crawler.java is as follows :
public void visit(Page page) {
int docid = page.getWebURL().getDocid();
String url = page.getWebURL().getURL();
String domain = page.getWebURL().getDomain();
String path = page.getWebURL().getPath();
String subDomain = page.getWebURL().getSubDomain();
String parentUrl = page.getWebURL().getParentUrl();
System.out.println("Docid: " + docid);
System.out.println("URL: " + url);
System.out.println("Domain: '" + domain + "'");
System.out.println("Sub-domain: '" + subDomain + "'");
System.out.println("Path: '" + path + "'");
System.out.println("Parent page: " + parentUrl);
Indexer.urls.add( url );
System.out.println("=============");
}
Code from my Indexer class :
public static Queue<String> urls = new LinkedList();
public static void main( String[] args )
{
while( urls.isEmpty() )
{
//System.out.println("Empty send queue");
Thread.sleep(sleepTime);
}
System.out.println( urls.poll() );
}
Okay, so I solved my problem by doing as suggested by BigMike. I implemented the Runnable interface in my two classes, and then ran those 2 classes as threads within the main function of a new third class.
Thanks everyone for all your help ! :)
Related
I was trying to reload config.yml file on command with Bukkit Plugin, I don't know how to do it.
I searched on google and I found one answer, but when I used it, config.yml was not generating. Here's my code:
BlockChanger
Please help
First you need to remove the final modifier from your config variables, else this can't refresh from config file.
Then you need a method for reload the config and set the config variables again. An example based on your code:
#Override
public void onEnable() {
loadConfig(this);
}
private final String prefix = ChatColor.AQUA + "[";
private String prefixTrue;
private String prefixFalse;
public void loadConfig(Plugin plugin) {
File file = new File(plugin.getDataFolder().getAbsolutePath() + "/config.yml");
FileConfiguration cfg = YamlConfiguration.loadConfiguration(file);
prefixTrue = prefix + cfg.getString("prefix") + "]" + ChatColor.GREEN + " ";
prefixFalse = prefix + cfg.getString("prefix") + "]" + ChatColor.RED + " ";
}
Make sure that you call the method loadConfig in onEnable and every time you want to reload the config
I am using Java-8, I would like to check whether the URL is valid or not based on pattern.
If valid then I should get the attributes bookId, authorId, category, mediaId
Pattern: <basepath>/books/<bookId>/author/<authorId>/<isbn>/<category>/mediaId/<filename>
And this is the sample URL
URL => https:/<baseurl>/v1/files/library/books/1234-4567/author/56784589/32475622347586/media/324785643257567/507f1f77bcf86cd799439011_400.png
Here Basepath is /v1/files/library.
I see some pattern matchings but I couldn't relate with my use-case, probably I was not good at reg-ex. I am also using apache-common-utils but I am not sure How to achieve it either.
Any help or hint would be really appreciable.
Try this solution (uses named capture groups in regex):
public static void main(String[] args)
{
Pattern p = Pattern.compile("http[s]?:.+/books/(?<bookId>[^/]+)/author/(?<authorId>[^/]+)/(?<isbn>[^/]+)/media/(?<mediaId>[^/]+)/(?<filename>.+)");
Matcher m = p.matcher("https:/<baseurl>/v1/files/library/books/1234-4567/author/56784589/32475622347586/media/324785643257567/507f1f77bcf86cd799439011_400.png");
if (m.matches())
{
System.out.println("bookId = " + m.group("bookId"));
System.out.println("authorId = " + m.group("authorId"));
System.out.println("isbn = " + m.group("isbn"));
System.out.println("mediaId = " + m.group("mediaId"));
System.out.println("filename = " + m.group("filename"));
}
}
prints:
bookId = 1234-4567
authorId = 56784589
isbn = 32475622347586
mediaId = 324785643257567
filename = 507f1f77bcf86cd799439011_400.png
i need some help with regex in the following case.
I'm reading a folder with multiple files like these ones A.AAA2000.XYZ or B.BBB2000.AY
I have to search in every file for a line(or lines) with a pattern like this:
CALL(or CALL-PROC or ENTER) $XX.whatever,whatever1,whatever2 and so on.
the XX.whatever can be another file in my folder or it doesn't even exist. What i need to do is see what files contain that pattern and in that pattern if those XX.whatever are files or don't exist and output the ones that don't exist. The problem is i have to stop at the first occurence of "," otherwise i get false results and i can't seem to get it to work properly. I did everything except getting rid of that ",". I attached some code and example below, please help if you can:
Example (as intended to work):
Searching file A.AAA2000.XYZ
Found procedure(s): $XX.B.BBB.2000.AY,LALA,LALA1,LALA2
Searching file B.BBB.2000.AY
Found procedure(s): $XX.C.CCC.2000.XYZ,LALALA,LALALALA,LALALALA
Searching file C.CCC.2000.XYZ
ERROR: File doesn't exist or no procedures called
Procedures found:
B.BBB.2000.AY
Procedures not found:
C.CCC.2000.XYZ
Example2 (how it's working right now):
Searching file A.AAA2000.XYZ
Found procedure(s): #XX.B.BBB.2000.AY,LALA,LALA1,LALA2
Searching file B.BBB.2000.AY
Found procedure(s): #XX.C.CCC.2000.XYZ,LALALA,LALALALA,LALALALA
Searching file C.CCC.2000.XYZ
ERROR: File doesn't exist or no procedures called
...........................
...........................
...........................
Procedures found:
XX.C.CCC.2000.XYZ,LALALA,LALALALA,LALALALALA
Procedures not found:
B.BBB.2000.AY,LALA,LALA1,LALA2
C.CCC.2000.XYZ,LALALA,LALALALA,LALALALA
Parts of code:
private static final String[] _keyWords = {"CALL-PROC", "CALL", "ENTER"};
private static final String _procedureRegex = ".* \\$PR\\..*";
private static final String _lineSkipper = "/REMARK";
private static final String _procedureNameFormat = "\\$PR\\..+";
private static boolean CallsProcedure(String givenLine)
{for (String keyWord : _keyWords) {
if (givenLine.contains(keyWord) && !givenLine.contains(_lineSkipper)) {
Pattern procedurePattern = Pattern.compile(_procedureRegex);
Matcher procedureMatcher = procedurePattern.matcher(givenLine);
return procedureMatcher.find();
}
}
READING:
private void ReadContent(File givenFile,
HashMap<String, HashSet<String>> whereToAddProcedures,
HashMap<String, HashSet<String>> whereToAddFiles) throws IOException {
System.out.println("Processing file " + givenFile.getAbsolutePath());
BufferedReader fileReader = new BufferedReader(new FileReader(givenFile));
String currentLine;
while ((currentLine = fileReader.readLine()) != null) {
if (CallsProcedure(currentLine)) {
String CProc = currentLine.split("\\$PR\\.")[1];
if (whereToAddProcedures.containsKey(CProc)) {
System.out.println("Procedure " + CProc + " already exists, adding more paths.");
whereToAddProcedures.get(CProc).add(givenFile.getAbsolutePath());
} else {
System.out.println("Adding Procedure " + CProc);
whereToAddProcedures.put(CProc,
new HashSet<>(Collections.singletonList(givenFile.getAbsolutePath())));
}
if (givenFile.getName().matches(_procedureNameFormat)) {
if (whereToAddFiles.containsKey(givenFile.getAbsolutePath())) {
System.out.println("File " + givenFile.getName()
+ " already has procedure calls, adding " + CProc);
whereToAddProcedures.get(givenFile.getName()).add(CProc);
} else {
System.out.println("Adding Procedure Call for " + CProc + " to "
+ givenFile.getName());
whereToAddProcedures.put(givenFile.getName(),
new HashSet<>(Collections.singletonList(CProc)));
}
}
}
}
fileReader.close();
If the comma is a marker of the end of the pattern you can make it the last position of the regex, stoping the matching when a comma appear. Like this
_procedureRegex = ".* \\$PR\\.[^,]*";
I am trying to crawl URLs in order to extract other URLs inside of each URL. To do such, I read the HTML code of the page, read each line of each, match it with a pattern and then extract the needed part as shown below:
public class SimpleCrawler {
static String pattern="https://www\\.([^&]+)\\.(?:com|net|org|)/([^&]+)";
static Pattern UrlPattern = Pattern.compile (pattern);
static Matcher UrlMatcher;
public static void main(String[] args) {
try {
URL url = new URL("https://stackoverflow.com/");
BufferedReader br = new BufferedReader(new InputStreamReader(url.openStream()));
while((String line = br.readLine())!=null){
UrlMatcher= UrlPattern.matcher(line);
if(UrlMatcher.find())
{
String extractedPath = UrlMatcher.group(1);
String extractedPath2 = UrlMatcher.group(2);
System.out.println("http://www."+extractedPath+".com"+extractedPath2);
}
}
} catch (Exception ex) {
ex.printStackTrace();
}
}
}
However, there some issue with it which I would like to address them:
How is it possible to make either http and www or even both of them, optional? I have encountered many cases that there are links without either or both parts, so the regex will not match them.
According to my code, I make two groups, one between http until the domain extension and the second is whatever comes after it. This, however, causes two sub-problems:
2.1 Since it is HTML codes, the rest of the HTML tags that may come after the URL will be extracted to.
2.2 In the System.out.println("http://www."+extractedPath+".com"+extractedPath2); I cannot make sure if it shows right URL (regardless of previous issues) because I do not know which domain extension it is matched with.
Last but not least, I wonder how to match both http and https as well?
How about:
try {
boolean foundMatch = subjectString.matches(
"(?imx)^\n" +
"(# Scheme\n" +
" [a-z][a-z0-9+\\-.]*:\n" +
" (# Authority & path\n" +
" //\n" +
" ([a-z0-9\\-._~%!$&'()*+,;=]+#)? # User\n" +
" ([a-z0-9\\-._~%]+ # Named host\n" +
" |\\[[a-f0-9:.]+\\] # IPv6 host\n" +
" |\\[v[a-f0-9][a-z0-9\\-._~%!$&'()*+,;=:]+\\]) # IPvFuture host\n" +
" (:[0-9]+)? # Port\n" +
" (/[a-z0-9\\-._~%!$&'()*+,;=:#]+)*/? # Path\n" +
" |# Path without authority\n" +
" (/?[a-z0-9\\-._~%!$&'()*+,;=:#]+(/[a-z0-9\\-._~%!$&'()*+,;=:#]+)*/?)?\n" +
" )\n" +
"|# Relative URL (no scheme or authority)\n" +
" ([a-z0-9\\-._~%!$&'()*+,;=#]+(/[a-z0-9\\-._~%!$&'()*+,;=:#]+)*/? # Relative path\n" +
" |(/[a-z0-9\\-._~%!$&'()*+,;=:#]+)+/?) # Absolute path\n" +
")\n" +
"# Query\n" +
"(\\?[a-z0-9\\-._~%!$&'()*+,;=:#/?]*)?\n" +
"# Fragment\n" +
"(\\#[a-z0-9\\-._~%!$&'()*+,;=:#/?]*)?\n" +
"$");
} catch (PatternSyntaxException ex) {
// Syntax error in the regular expression
}
With one library. I used HtmlCleaner. It does the job.
you can find it at:
http://htmlcleaner.sourceforge.net/javause.php
another example (not tested) with jsoup:
http://jsoup.org/cookbook/extracting-data/example-list-links
rather readable.
You can enhance it, choose < A > tags or others, HREF, etc...
or be more precise with case (HreF, HRef, ...): for exercise
import org.htmlcleaner.*;
public static Vector<String> HTML2URLS(String _source)
{
Vector<String> result=new Vector<String>();
HtmlCleaner cleaner = new HtmlCleaner();
// Principal Node
TagNode node = cleaner.clean(_source);
// All nodes
TagNode[] myNodes =node.getAllElements(true);
int s=myNodes.length;
for (int pos=0;pos<s;pos++)
{
TagNode tn=myNodes[pos];
// all attributes
Map<String,String> mss=tn.getAttributes();
// Name of tag
String name=tn.getName();
// Is there href ?
String href="";
if (mss.containsKey("href")) href=mss.get("href");
if (mss.containsKey("HREF")) href=mss.get("HREF");
if (name.equals("a")) result.add(href);
if (name.equals("A")) result.add(href);
}
return result;
}
This question is unlikely to help any future visitors; it is only relevant to a small geographic area, a specific moment in time, or an extraordinarily narrow situation that is not generally applicable to the worldwide audience of the internet. For help making this question more broadly applicable, visit the help center.
Closed 9 years ago.
I'm trying to use the Basic crawler example in crawler4j. I took the code from the crawler4j website here.
package edu.crawler;
import edu.uci.ics.crawler4j.crawler.Page;
import edu.uci.ics.crawler4j.crawler.WebCrawler;
import edu.uci.ics.crawler4j.parser.HtmlParseData;
import edu.uci.ics.crawler4j.url.WebURL;
import java.util.List;
import java.util.regex.Pattern;
import org.apache.http.Header;
public class MyCrawler extends WebCrawler {
private final static Pattern FILTERS = Pattern.compile(".*(\\.(css|js|bmp|gif|jpe?g" + "|png|tiff?|mid|mp2|mp3|mp4"
+ "|wav|avi|mov|mpeg|ram|m4v|pdf" + "|rm|smil|wmv|swf|wma|zip|rar|gz))$");
/**
* You should implement this function to specify whether the given url
* should be crawled or not (based on your crawling logic).
*/
#Override
public boolean shouldVisit(WebURL url) {
String href = url.getURL().toLowerCase();
return !FILTERS.matcher(href).matches() && href.startsWith("http://www.ics.uci.edu/");
}
/**
* This function is called when a page is fetched and ready to be processed
* by your program.
*/
#Override
public void visit(Page page) {
int docid = page.getWebURL().getDocid();
String url = page.getWebURL().getURL();
String domain = page.getWebURL().getDomain();
String path = page.getWebURL().getPath();
String subDomain = page.getWebURL().getSubDomain();
String parentUrl = page.getWebURL().getParentUrl();
String anchor = page.getWebURL().getAnchor();
System.out.println("Docid: " + docid);
System.out.println("URL: " + url);
System.out.println("Domain: '" + domain + "'");
System.out.println("Sub-domain: '" + subDomain + "'");
System.out.println("Path: '" + path + "'");
System.out.println("Parent page: " + parentUrl);
System.out.println("Anchor text: " + anchor);
if (page.getParseData() instanceof HtmlParseData) {
HtmlParseData htmlParseData = (HtmlParseData) page.getParseData();
String text = htmlParseData.getText();
String html = htmlParseData.getHtml();
List<WebURL> links = htmlParseData.getOutgoingUrls();
System.out.println("Text length: " + text.length());
System.out.println("Html length: " + html.length());
System.out.println("Number of outgoing links: " + links.size());
}
Header[] responseHeaders = page.getFetchResponseHeaders();
if (responseHeaders != null) {
System.out.println("Response headers:");
for (Header header : responseHeaders) {
System.out.println("\t" + header.getName() + ": " + header.getValue());
}
}
System.out.println("=============");
}
}
Above is the code for the crawler class from the example.
public class Controller {
public static void main(String[] args) throws Exception {
String crawlStorageFolder = "../data/";
int numberOfCrawlers = 7;
CrawlConfig config = new CrawlConfig();
config.setCrawlStorageFolder(crawlStorageFolder);
/*
* Instantiate the controller for this crawl.
*/
PageFetcher pageFetcher = new PageFetcher(config);
RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);
CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer);
/*
* For each crawl, you need to add some seed urls. These are the first
* URLs that are fetched and then the crawler starts following links
* which are found in these pages
*/
controller.addSeed("http://www.ics.uci.edu/~welling/");
controller.addSeed("http://www.ics.uci.edu/~lopes/");
controller.addSeed("http://www.ics.uci.edu/");
/*
* Start the crawl. This is a blocking operation, meaning that your code
* will reach the line after this only when crawling is finished.
*/
controller.start(MyCrawler.class, numberOfCrawlers);
}
}
Above is the class for the controller class for the web crawler.
When I try to run the Controller class from my IDE (Intellij) I get the following error:
Exception in thread "main" java.lang.UnsupportedClassVersionError: edu/uci/ics/crawler4j/crawler/CrawlConfig : Unsupported major.minor version 51.0
Is there something about the maven config that is found here that I should know? Do I have to use a different version or something?
The problem wasn't with crawler4j. The problem was that the version of Java that I was using was different from the latest version of Java that is used in crawler4j. I switched the version right before they updated to Java 7 and everything worked fine. I'm guessing that upgrading my version of Java to 7 would have the same effect.