I'm extracting named-entities from news articles with the use of Stanford NER CRFClassifier and in order to implement active learning, I would like to know what are the confidence scores of the classes for each labelled entity.
Exemple of display :
LOCATION(0.20) PERSON(0.10) ORGANIZATION(0.60) MISC(0.10)
Here is my code for extracting named-entities from a text :
AbstractSequenceClassifier<CoreLabel> classifier = CRFClassifier.getClassifierNoExceptions(classifier_path);
String annnotatedText = classifier.classifyWithInlineXML(text);
Is there a workaround to get thoses values along with the annotations ?
I've found it out by myself, in CRFClassifier's doc it is written :
Probabilities assigned by the CRF can be interrogated using either the
printProbsDocument() or getCliqueTrees() methods.
The first method is not useful since it only prints what I want on the console, but I want to be able to access this data, so I have read how this method is coded and copied a bit its behaviour like this :
List<CoreLabel> classifiedLabels = classifier.classify(sentences);
CRFCliqueTree<String> cliqueTree = classifier.getCliqueTree(classifiedLabels);
for (int i = 0; i < cliqueTree.length(); i++) {
CoreLabel wi = classifiedLabels.get(i);
for (Iterator<String> iter = classifier.classIndex.iterator(); iter.hasNext();) {
String label = iter.next();
int index = classifier.classIndex.indexOf(label);
double prob = cliqueTree.prob(i, index);
System.out.println("\t" + label + "(" + prob + ")");
}
String tag = StringUtils.getNotNullString(wi.get(CoreAnnotations.AnswerAnnotation.class));
System.out.println("Class : " + tag);
}
Related
I have a List of Strings containing names and surnames and i have a free text.
List<String> names; // contains: "jon", "snow", "arya", "stark", ...
String text = "jon snow and stark arya";
I have to find all the names and surnames, possibly with a Java Regex (so using Pattern and Matcher objects). So i want something like:
List<String> foundNames; // contains: "jon snow", "stark arya"
I have done this 2 possible ways but without using Regex, they are not static beacause part of a NameFinder class that have a list "names" that contains all the names.
public List<String> findNamePairs(String text) {
List<String> foundNamePairs = new ArrayList<String>();
List<String> names = this.names;
text = text.toLowerCase();
for (String name : names) {
String nameToSearch = name + " ";
int index = text.indexOf(nameToSearch);
if (index != -1) {
String textSubstring = text.substring(index + nameToSearch.length());
for (String nameInner : names) {
if (name != nameInner && textSubstring.startsWith(nameInner)) {
foundNamePairs.add(name + " " + nameInner);
}
}
}
}
removeDuplicateFromList(foundNamePairs);
return foundNamePairs;
}
or in a worse (very bad) way (creating all the possible pairs):
public List<String> findNamePairsInTextNotOpt(String text) {
List<String> foundNamePairs = new ArrayList<String>();
text = text.toLowerCase();
List<String> pairs = getNamePairs(this.names);
for (String name : pairs) {
if (text.contains(name)) {
foundNamePairs.add(name);
}
}
removeDuplicateFromList(foundNamePairs);
return foundNamePairs;
}
You can create a regex using the list of names and then use find to find the names. To ensure you don't have duplicates, you can check if the name is already in the list of found names. The code would look like this.
List<String> names = Arrays.asList("jon", "snow", "stark", "arya");
String text = "jon snow and Stark arya and again Jon Snow";
StringBuilder regexBuilder = new StringBuilder();
for (int i = 0; i < names.size(); i += 2) {
regexBuilder.append("(")
.append(names.get(i))
.append(" ")
.append(names.get(i + 1))
.append(")");
if (i != names.size() - 2) regexBuilder.append("|");
}
System.out.println(regexBuilder.toString());
Pattern compile = Pattern.compile(regexBuilder.toString(), Pattern.CASE_INSENSITIVE);
Matcher matcher = compile.matcher(text);
List<String> found = new ArrayList<>();
int start = 0;
while (matcher.find(start)) {
String match = matcher.group().toLowerCase();
if (!found.contains(match)) found.add(match);
start = matcher.end();
}
for (String s : found) System.out.println("found: " + s);
If you want to be case sensitive just remove the flag in Pattern.compile(). If all matches have the same capitalization you can omit the toLowerCase() in the while loop as well.
But make sure that the list contains a multiple of 2 as list elements (name and surname) as the for-loop will throw an IndexOutOfBoundsException otherwise. Also the order matters in my code. It will only find the name pairs in the order they occur in the list. If you want to have both orders, you can change the regex generation accordingly.
Edit: As it is unknown whether a name is a surname or name and which build a name/surname pair, the regex generation must be done differently.
StringBuilder regexBuilder = new StringBuilder("(");
for (int i = 0; i < names.size(); i++) {
regexBuilder.append("(")
.append(names.get(i))
.append(")");
if (i != names.size() - 1) regexBuilder.append("|");
}
regexBuilder.append(") ");
regexBuilder.append(regexBuilder);
regexBuilder.setLength(regexBuilder.length() - 1);
System.out.println(regexBuilder.toString());
This regex will match any of the given names followed by a space and then again any of the names.
I am stuck up in this date range query. I need to extract data from particular facebook pages for a specified date range.I am able to do this individually, by using since and until fields. But how to use these two fields together.
Here is my code:
public static String getFacebookPostes(Facebook facebook, String searchPost)
throws FacebookException {
String searchResult = "Item : " + searchPost + "\n";
StringBuffer searchMessage = new StringBuffer();
ResponseList<Post> results = facebook.searchPosts(searchPost, new Reading().since("2014-04-02"));
String userId="";
for (Post post : results) {
System.out.println(post.getMessage());
searchMessage.append(post.getMessage() + "\n");
for (int j = 0; j < post.getComments().size(); j++) {
searchMessage.append(post.getComments().get(j).getFrom()
.getName()
+ ", ");
searchMessage.append(post.getComments().get(j).getMessage()
+ ", ");
searchMessage.append(post.getComments().get(j).getCreatedTime()
+ ", ");
searchMessage.append(post.getComments().get(j).getLikeCount()
+ "\n");
userId=post.getComments().get(j).getFrom().getId();
User user = facebook.getUser(userId);
//System.out.println("ROCK");
System.out.println(user);
}
}
Any guidance is appreciated. Thanks in advance.
PS : I am using facebook4j-core-2.0.2.jar and eclipse kepler.
According to http://facebook4j.org/en/code-examples.html you can use all date formats descirbed in http://www.php.net/manual/en/datetime.formats.date.php
From my understanding the code would then look like this:
ResponseList<Post> results = facebook.searchPosts(searchPost, new Reading().since("2014/04/02").until("2014/04/08"));
JAVASCRIPT or JAVA solution needed
The solution I am looking for could use java or javascript. I have the html code in a string so I could manipulate it before using it with java or afterwards with javascript.
problem
Anyway, I have to wrap each word with a tag. For example:
<html> ... >
Hello every one, cheers
< ... </html>
should be changed to
<html> ... >
<word>Hello</word> <word>every</word> <word>one</word>, <word>cheers</word>
< ... </html>
Why?
This will help me use javascript to select/highlight a word. It seems the only way to do it is to use the function highlightElementAtPoint which I added in the JAVASCRIPT hint: It simply finds the element of a certain x,y coordinate and highlights it. I figured that if every word is an element, it will be doable.
The idea is to use this approach to allow us to detect highlighted text in an android WebView even if that would mean to use a twisted highlighting method. Think a bit more and you will find many other applications for this.
JAVASCRIPT hint
I am using the following code to highlight a word; however, this will highlight the whole text belonging to a certain tag. When each word is a tag, this will work to some extent. If there is a substitute that will allow me to highlight a word at a certain position, it would also be a solution.
function highlightElementAtPoint(xOrdinate, yOrdinate) {
var theElement = document.elementFromPoint(xOrdinate, yOrdinate);
selectedElement = theElement;
theElement.style.backgroundColor = "yellow";
var theName = theElement.nodeName;
var theArray = document.getElementsByTagName(theName);
var theIndex = -1;
for (i = 0; i < theArray.length; i++) {
if (theArray[i] == theElement) {
theIndex = i;
}
}
window.androidselection.selected(theElement.innerHTML);
return theName + " " + theIndex;
}
Try to use something like
String yourStringHere = yourStringHere.replace(" ", "</word> <word>" )
yourStringHere.replace("<html></word>", "<html>" );//remove first closing word-tag
Should work, maybe u have to change sth...
var tags = document.body.innerText.match(/\w+/g);
for(var i=0;i<tags.length;i++){
tags[i] = '<word>' + tags[i] + '</word>';
}
Or as #ThomasK said:
var tags = document.body.innerText;
tags = '<word>' + tags + '</word>';
tags = tags.replace(/\s/g,'</word><word>');
But you have to keep in mind: .replace(" ",foo) only replaces the space once. For multiple replaces you have to use .replace(/\s+/g,foo)
And as #ajax333221 said, the second way will include commas, dots and other symbols, so the better solution is the first
JSFiddle example: http://jsfiddle.net/c6ftq/4/
inputStr = inputStr.replaceAll("(?<!</?)\\w++(?!\\s*>)","<word>$0</word>");
You can try following code,
import java.util.StringTokenizer;
public class myTag
{
static String startWordTag = "<Word>";
static String endWordTag = "</Word>";
static String space = " ";
static String myText = "Hello how are you ";
public static void main ( String args[] )
{
StringTokenizer st = new StringTokenizer (myText," ");
StringBuffer sb = new StringBuffer();
while ( st.hasMoreTokens() )
{
sb.append(startWordTag);
sb.append(st.nextToken());
sb.append(endWordTag);
sb.append(space);
}
System.out.println ( "Result:" + sb.toString() );
}
}
I have a text file with Tag - Value format data. I want to parse this file to form a Trie. What will be the best approach?
Sample of File: (String inside "" is a tag and '#' is used to comment the line.)
#Hi, this is a sample file.
"abcd" = 12;
"abcde" = 16;
"http" = 32;
"sip" = 21;
This is basically a properties file, I would remove the " around the tags, then use the Properties class http://java.sun.com/javase/6/docs/api/java/util/Properties.html#load(java.io.Reader) to load the file.
Read that in using Properties and trim the excess parts (", ; and whitespace). Short example:
Properties props = Properties.load(this.getClass()
.getResourceAsStream("path/to.file"));
Map<String, String> cleanedProps = new HashMap<String, String>();
for(Entry pair : props.entrySet()) {
cleanedProps.put(cleanKey(pair.getKey()),
cleanValue(pair.getValue()));
}
Note that in the solution above you only need implement the cleanKey() and cleanValue() yourself. You may want to change the datatypes accordingly if necessary, I used Strings just as an example.
There are many ways to do this; others have mentioned that java.util.Properties gets most of the job done, and is probably the most robust solution.
One other option is to use a java.util.Scanner.
Use the Scanner(File) constructor to scan a file
You can useDelimiter appropriate for this format
nextInt() can be used to extract the numbers
Perhaps you can put the key/value pairs into a SortedMap<String,Integer>
Here's an example that scans a String for simplicity:
String text =
"#Hi, this is a sample file.\n" +
"\n" +
"\"abcd\" = 12; \r\n" +
"\"abcde\"=16;\n" +
" # \"ignore\" = 13;\n" +
"\"http\" = 32; # Comment here \r" +
"\"zzz\" = 666; # Out of order! \r" +
" \"sip\" = 21 ;";
System.out.println(text);
System.out.println("----------");
SortedMap<String,Integer> map = new TreeMap<String,Integer>();
Scanner sc = new Scanner(text).useDelimiter("[\"=; ]+");
while (sc.hasNextLine()) {
if (sc.hasNext("[a-z]+")) {
map.put(sc.next(), sc.nextInt());
}
sc.nextLine();
}
System.out.println(map);
This prints (as seen on ideone.com):
#Hi, this is a sample file.
"abcd" = 12;
"abcde"=16;
# "ignore" = 13;
"http" = 32; # Comment here
"zzz" = 666; # Out of order!
"sip" = 21 ;
----------
{abcd=12, abcde=16, http=32, sip=21, zzz=666}
Related questions
Validating input using java.util.Scanner
Iterate Over Map
See also
regular-expressions.info/Tutorial
The most natural way is probably this:
void doParse() {
String text =
"#Hi, this is a sample file.\n"
+ "\"abcd\" = 12;\n"
+ "\"abcde\" = 16;\n"
+ "#More comment\n"
+ "\"http\" = 32;\n"
+ "\"sip\" = 21;";
Matcher matcher = Pattern.compile("\"(.+)\" = ([0-9]+)").matcher(text);
while (matcher.find()) {
String txt = matcher.group(1);
int val = Integer.parseInt(matcher.group(2));
System.out.format("parsed: %s , %d%n", txt, val);
}
}
I'm using SAX to parse some XML. In my handler's startElement() method I'm trying to read the value of an attribute named xsi:type with something like:
String type = attributes.getValue("xsi:type");
However, it always returns null. This works fine for everything else so I'm assuming that it's due to the namespace prefix. How can I get this value?
Probably this can help, try to play a little with this. This will return the names and the value of the attributes found which can be useful to find the name to use to query.
if (attributes.getLength() > 0) {
for (int i = 0; i < attributes.getLength(); i++) {
System.out.print ("name: " + attributes.getQName(i)));
System.out.println(" value: " + attributes.getValue(i)));
}
}
Take also a look here and here check the function: getURI
Try asking SAX what it thinks the attribute's qName is:
for (int i=0; i < attributes.getLength(); i++) {
String qName = attributes.getQName(i);
System.out.println("qName for position " + i + ": " + qName);
}