Simple CoreNLP : ClassNotFoundException - java

My simple coreNLP code is working with main method as shown in code below.
package com.books.servlet;
import edu.stanford.nlp.simple.Document;
import edu.stanford.nlp.simple.Sentence;
public class SimpleCoreNLPDemo {
public static void main(String[] args) {
// Create a document. No computation is done yet.
Document doc = new Document("add your text here! It can contain multiple sentences.");
for (Sentence sent : doc.sentences()) {
// Will iterate over two sentences
// We're only asking for words -- no need to load any models yet
System.out.println("The second word of the sentence '" + sent + "' is " + sent.word(1));
// When we ask for the lemma, it will load and run the part of speech tagger
System.out.println("The third lemma of the sentence '" + sent + "' is " + sent.lemma(2));
// When we ask for the parse, it will load and run the parser
System.out.println("The parse of the sentence '" + sent + "' is " + sent.parse());
}
}
}
Then I used this code in my web application as below. when I execute the code. I get Below error and exception
My web app code
public void me(){
Document doc = new Document("add your text here! It can contain multiple sentences.");
for (Sentence sent : doc.sentences()) {
// Will iterate over two sentences
// We're only asking for words -- no need to load any models yet
System.out.println("The second word of the sentence '" + sent + "' is " + sent.word(1));
// When we ask for the lemma, it will load and run the part of speech tagger
System.out.println("The third lemma of the sentence '" + sent + "' is " + sent.lemma(2));
// When we ask for the parse, it will load and run the parser
System.out.println("The parse of the sentence '" + sent + "' is " + sent.parse());
} }
I have downloaded all the jar files and added to the build path. it is
working fine with main method

I just resolved the problem.
I copied all my Stanford simple NLP jar file to the directory
/WEB-INF/lib
and now my code is working fine. Below is my simple method and its output for your information.
public String s = "I like java and python";
public static Set<String> nounPhrases = new HashSet<>();
public void me() {
Document doc = new Document(" " + s);
for (Sentence sent : doc.sentences()) {
System.out.println("The parse of the sentence '" + sent + "' is " + sent.parse());
}
}
output
The parse of the sentence 'I like java and python' is (ROOT (S (NP (PRP I)) (VP (VBP like) (NP (NN java) (CC and) (NN python)))))

Related

Directing the search depths in Crawler4j Solr

I am trying to make the crawler "abort" searching a certain subdomain every time it doesn't find a relevant page after 3 consecutive tries. After extracting the title and the text of the page I start looking for the correct pages to submit to my solr collection. (I do not want to add pages that don't match this query)
public void visit(Page page)
{
int docid = page.getWebURL().getDocid();
String url = page.getWebURL().getURL();
String domain = page.getWebURL().getDomain();
String path = page.getWebURL().getPath();
String subDomain = page.getWebURL().getSubDomain();
String parentUrl = page.getWebURL().getParentUrl();
String anchor = page.getWebURL().getAnchor();
System.out.println("Docid: " + docid);
System.out.println("URL: " + url);
System.out.println("Domain: '" + domain + "'");
System.out.println("Sub-domain: '" + subDomain + "'");
System.out.println("Path: '" + path + "'");
System.out.println("Parent page: " + parentUrl);
System.out.println("Anchor text: " + anchor);
System.out.println("ContentType: " + page.getContentType());
if(page.getParseData() instanceof HtmlParseData) {
String title, text;
HtmlParseData theHtmlParseData = (HtmlParseData) page.getParseData();
title = theHtmlParseData.getTitle();
text = theHtmlParseData.getText();
if ( (title.toLowerCase().contains(" word1 ") && title.toLowerCase().contains(" word2 ")) || (text.toLowerCase().contains(" word1 ") && text.toLowerCase().contains(" word2 ")) ) {
//
// submit to SOLR server
//
submit(page);
Header[] responseHeaders = page.getFetchResponseHeaders();
if (responseHeaders != null) {
System.out.println("Response headers:");
for (Header header : responseHeaders) {
System.out.println("\t" + header.getName() + ": " + header.getValue());
}
}
failedcounter = 0;// we start counting for 3 consecutive pages
} else {
failedcounter++;
}
if (failedcounter == 3) {
failedcounter = 0; // we start counting for 3 consecutive pages
int parent = page.getWebURL().getParentDocid();
parent....HtmlParseData.setOutgoingUrls(null);
}
my question is, how do I edit the last line of this code so that i can retrieve the parent "page object" and delete its outgoing urls, so that the crawl moves on to the rest of the subdomains.
Currently i cannot find a function that can get me from the parent id to the page data, for deleting the urls.
The visit(...) method is called as one of the last statements of processPage(...) (line 523 in WebCrawler).
The outgoing links are already added to the crawler's frontier (and might be processed by other crawler processes as soon as they are added).
You could define the behaviour described by adjusting the shouldVisit(...) or (depending on the exact use-case) in shouldFollowLinksIn(...) of the crawler

Convert HOCON string into Java object

One of my webservice return below Java string:
[
{
id=5d93532e77490b00013d8862,
app=null,
manufacturer=pearsonEducation,
bookUid=bookIsbn,
model=2019,
firmware=[1.0],
bookName=devotional,
accountLinking=mandatory
}
]
I have the equivalent Java object for the above string. I would like to typecast or convert the above java string into Java Object.
I couldn't type-cast it since it's a String, not an object. So, I was trying to convert the Java string to JSON string then I can write that string into Java object but no luck getting invalid character "=" exception.
Can you change the web service to return JSON?
That's not possible. They are not changing their contracts. It would be super easy if they returned JSON.
The format your web-service returns has it's own name HOCON. (You can read more about it here)
You do not need your custom parser. Do not try to reinvent the wheel.
Use an existing one instead.
Add this maven dependency to your project:
<dependency>
<groupId>com.typesafe</groupId>
<artifactId>config</artifactId>
<version>1.3.0</version>
</dependency>
Then parse the response as follows:
Config config = ConfigFactory.parseString(text);
String id = config.getString("id");
Long model = config.getLong("model");
There is also an option to parse the whole string into a POJO:
MyResponsePojo response = ConfigBeanFactory.create(config, MyResponsePojo.class);
Unfortunately this parser does not allow null values. So you'll need to handle exceptions of type com.typesafe.config.ConfigException.Null.
Another option is to convert the HOCON string into JSON:
String hoconString = "...";
String jsonString = ConfigFactory.parseString(hoconString)
.root()
.render(ConfigRenderOptions.concise());
Then you can use any JSON-to-POJO mapper.
Well, this is definitely not the best answer to be given here, but it is possible, at least…
Manipulate the String in small steps like this in order to get a Map<String, String> which can be processed. See this example, it's very basic:
public static void main(String[] args) {
String data = "[\r\n"
+ " {\r\n"
+ " id=5d93532e77490b00013d8862, \r\n"
+ " app=null,\r\n"
+ " manufacturer=pearsonEducation, \r\n"
+ " bookUid=bookIsbn, \r\n"
+ " model=2019,\r\n"
+ " firmware=[1.0], \r\n"
+ " bookName=devotional, \r\n"
+ " accountLinking=mandatory\r\n"
+ " }\r\n"
+ "]";
// manipulate the String in order to have
String[] splitData = data
// no leading and trailing [ ] - cut the first and last char
.substring(1, data.length() - 1)
// no linebreaks
.replace("\n", "")
// no windows linebreaks
.replace("\r", "")
// no opening curly brackets
.replace("{", "")
// and no closing curly brackets.
.replace("}", "")
// Then split it by comma
.split(",");
// create a map to store the keys and values
Map<String, String> dataMap = new HashMap<>();
// iterate the key-value pairs connected with '='
for (String s : splitData) {
// split them by the equality symbol
String[] keyVal = s.trim().split("=");
// then take the key
String key = keyVal[0];
// and the value
String val = keyVal[1];
// and store them in the map ——> could be done directly, of course
dataMap.put(key, val);
}
// print the map content
dataMap.forEach((key, value) -> System.out.println(key + " ——> " + value));
}
Please note that I just copied your example String which may have caused the line breaks and I think it is not smart to just replace() all square brackets because the value firmware seems to include those as content.
In my opinion, we split the parse process in two step.
Format the output data to JSON.
Parse text by JSON utils.
In this demo code, i choose regex as format method, and fastjson as JSON tool. you can choose jackson or gson. Furthermore, I remove the [ ], you can put it back, then parse it into array.
import com.alibaba.fastjson.JSON;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class SerializedObject {
private String id;
private String app;
static Pattern compile = Pattern.compile("([a-zA-Z0-9.]+)");
public static void main(String[] args) {
String str =
" {\n" +
" id=5d93532e77490b00013d8862, \n" +
" app=null,\n" +
" manufacturer=pearsonEducation, \n" +
" bookUid=bookIsbn, \n" +
" model=2019,\n" +
" firmware=[1.0], \n" +
" bookName=devotional, \n" +
" accountLinking=mandatory\n" +
" }\n";
String s1 = str.replaceAll("=", ":");
StringBuffer sb = new StringBuffer();
Matcher matcher = compile.matcher(s1);
while (matcher.find()) {
matcher.appendReplacement(sb, "\"" + matcher.group(1) + "\"");
}
matcher.appendTail(sb);
System.out.println(sb.toString());
SerializedObject serializedObject = JSON.parseObject(sb.toString(), SerializedObject.class);
System.out.println(serializedObject);
}
}

Confusion with File instance function in Java: How to import an arbitrary .csv file into mysql instead of a specific one?

I currently already have a code that imports specific .csv files(directory provided) into mysql.
I'm trying to tweak it and play around with the create File() method demonstrated in the java tutorials ie. File file = new File("d:\\myproject\\java\\Hello.java"); And modified the code as follows:
My code is as shown:
import java.sql.Connection;
import java.sql.Statement;
import java.io.*;
public class ImportCsv {
public static void main(String[] args) {
ImportCsv.readCsvUsingLoad();
}
public static void readCsvUsingLoad() {
try (Connection connection = DBConnection.getConnection()) {
File file = new File("C:/Users/User/Desktop/Test/upload2.csv");
String loadQuery = "LOAD DATA LOCAL INFILE '" + "file" + "' INTO TABLE txn_tbl FIELDS TERMINATED BY ','"
+ " LINES TERMINATED BY '\n' " + "IGNORE 1 LINES(txn_amount, card_number, terminal_id)";
System.out.println(loadQuery);
Statement stmt = connection.createStatement();
stmt.execute(loadQuery);
System.out.println("Data import success");
} catch (Exception e) {
e.printStackTrace();
}
}
}
However, Intellij keeps throwing back the FileNotFoundException.
Am i misunderstanding the usage of file instance creation here?
Because you use the file name as a String "file" instead you have to remove the two quotes, and you don't need to create a file, you just need the path (String) :
String file = "C:/Users/User/Desktop/Test/upload2.csv";
String loadQuery = "LOAD DATA LOCAL INFILE '" + file + "' "
//---------------------------------------^^
+ "INTO TABLE txn_tbl FIELDS TERMINATED BY ',' "
+ "LINES TERMINATED BY '\n' "
+ "IGNORE 1 LINES(txn_amount, card_number, terminal_id)";
Another way you can use getAbsolutePath like this :
File file = new File("C:/Users/User/Desktop/Test/upload2.csv");
String path = file.getAbsolutePath();
String loadQuery = "LOAD DATA LOCAL INFILE '" + path + "' "
//---------------------------------------^^
+ "INTO TABLE txn_tbl FIELDS TERMINATED BY ',' "
+ "LINES TERMINATED BY '\n' "
+ "IGNORE 1 LINES(txn_amount, card_number, terminal_id)";

Crawling a URL in order to extract all the other URLs in that page

I am trying to crawl URLs in order to extract other URLs inside of each URL. To do such, I read the HTML code of the page, read each line of each, match it with a pattern and then extract the needed part as shown below:
public class SimpleCrawler {
static String pattern="https://www\\.([^&]+)\\.(?:com|net|org|)/([^&]+)";
static Pattern UrlPattern = Pattern.compile (pattern);
static Matcher UrlMatcher;
public static void main(String[] args) {
try {
URL url = new URL("https://stackoverflow.com/");
BufferedReader br = new BufferedReader(new InputStreamReader(url.openStream()));
while((String line = br.readLine())!=null){
UrlMatcher= UrlPattern.matcher(line);
if(UrlMatcher.find())
{
String extractedPath = UrlMatcher.group(1);
String extractedPath2 = UrlMatcher.group(2);
System.out.println("http://www."+extractedPath+".com"+extractedPath2);
}
}
} catch (Exception ex) {
ex.printStackTrace();
}
}
}
However, there some issue with it which I would like to address them:
How is it possible to make either http and www or even both of them, optional? I have encountered many cases that there are links without either or both parts, so the regex will not match them.
According to my code, I make two groups, one between http until the domain extension and the second is whatever comes after it. This, however, causes two sub-problems:
2.1 Since it is HTML codes, the rest of the HTML tags that may come after the URL will be extracted to.
2.2 In the System.out.println("http://www."+extractedPath+".com"+extractedPath2); I cannot make sure if it shows right URL (regardless of previous issues) because I do not know which domain extension it is matched with.
Last but not least, I wonder how to match both http and https as well?
How about:
try {
boolean foundMatch = subjectString.matches(
"(?imx)^\n" +
"(# Scheme\n" +
" [a-z][a-z0-9+\\-.]*:\n" +
" (# Authority & path\n" +
" //\n" +
" ([a-z0-9\\-._~%!$&'()*+,;=]+#)? # User\n" +
" ([a-z0-9\\-._~%]+ # Named host\n" +
" |\\[[a-f0-9:.]+\\] # IPv6 host\n" +
" |\\[v[a-f0-9][a-z0-9\\-._~%!$&'()*+,;=:]+\\]) # IPvFuture host\n" +
" (:[0-9]+)? # Port\n" +
" (/[a-z0-9\\-._~%!$&'()*+,;=:#]+)*/? # Path\n" +
" |# Path without authority\n" +
" (/?[a-z0-9\\-._~%!$&'()*+,;=:#]+(/[a-z0-9\\-._~%!$&'()*+,;=:#]+)*/?)?\n" +
" )\n" +
"|# Relative URL (no scheme or authority)\n" +
" ([a-z0-9\\-._~%!$&'()*+,;=#]+(/[a-z0-9\\-._~%!$&'()*+,;=:#]+)*/? # Relative path\n" +
" |(/[a-z0-9\\-._~%!$&'()*+,;=:#]+)+/?) # Absolute path\n" +
")\n" +
"# Query\n" +
"(\\?[a-z0-9\\-._~%!$&'()*+,;=:#/?]*)?\n" +
"# Fragment\n" +
"(\\#[a-z0-9\\-._~%!$&'()*+,;=:#/?]*)?\n" +
"$");
} catch (PatternSyntaxException ex) {
// Syntax error in the regular expression
}
With one library. I used HtmlCleaner. It does the job.
you can find it at:
http://htmlcleaner.sourceforge.net/javause.php
another example (not tested) with jsoup:
http://jsoup.org/cookbook/extracting-data/example-list-links
rather readable.
You can enhance it, choose < A > tags or others, HREF, etc...
or be more precise with case (HreF, HRef, ...): for exercise
import org.htmlcleaner.*;
public static Vector<String> HTML2URLS(String _source)
{
Vector<String> result=new Vector<String>();
HtmlCleaner cleaner = new HtmlCleaner();
// Principal Node
TagNode node = cleaner.clean(_source);
// All nodes
TagNode[] myNodes =node.getAllElements(true);
int s=myNodes.length;
for (int pos=0;pos<s;pos++)
{
TagNode tn=myNodes[pos];
// all attributes
Map<String,String> mss=tn.getAttributes();
// Name of tag
String name=tn.getName();
// Is there href ?
String href="";
if (mss.containsKey("href")) href=mss.get("href");
if (mss.containsKey("HREF")) href=mss.get("HREF");
if (name.equals("a")) result.add(href);
if (name.equals("A")) result.add(href);
}
return result;
}

Regular Expression issue, deleting whole lines

I have been trying for the last couple of hours to create a regular expression that deletes lines of text that start with particular wordage after selecting out a rating.
Below is what I'm trying to delete. I'm also trying to pull the Rating out of the paragraph (it's pass or fail).
Review Master: text here
1111111111 text here
Rating: Fail text here
Review Master Page text here
I am trying to delete all lines that start with the following.
I have
^Review Master:
^[0-9]{10}
^Rating:
^Review Master Page
Again, I am struggling with the replacement(deleting) and finding only the rating.
If you want to find those exact lines in your file then this will work:
Review Master:\n\\d++\nRating:\\s*+(\\w++)\nReview Master Page"
Here is an example using your input as a test string:
public static void main(String[] args) throws Exception {
final String in = "Review Master:\n"
+ "1111111111\n"
+ "Rating: Fail\n"
+ "Review Master Page";
final Matcher m = Pattern.compile(""
+ "Review Master:\n"
+ "\\d++\n"
+ "Rating:\\s*+(\\w++)\n"
+ "Review Master Page").matcher(in);
while(m.find()) {
System.out.println(m.group(1));
}
}
Output:
Fail
If you want to delete those lines then your need to replace the pattern in the file which your have as a String:
public static void main(String[] args) throws Exception {
final String in = "Some other text\n"
+ "Review Master:\n"
+ "1111111111\n"
+ "Rating: Fail\n"
+ "Review Master Page\n"
+ "Some final text";
final Matcher m = Pattern.compile(""
+ "\n?"
+ "Review Master:\n"
+ "\\d++\n"
+ "Rating:\\s*+(\\w++)\n"
+ "Review Master Page").matcher(in);
final StringBuffer output = new StringBuffer();
while (m.find()) {
System.out.println(m.group(1));
m.appendReplacement(output, "");
}
m.appendTail(output);
System.out.println("Result: \"" + output.toString() + "\"");
}
Output:
Fail
Result: "Some other text
Some final text"
i.e. we use the Matcher to yank the pass/fail from the input and also build the output replacing the block of text matched with nothing.
You have not made clear which parts of the patterns are variable.

Categories