Accessing annotations in UIMA

Accessing annotations in UIMA - java

Is there a way in UIMA to access the annotations from the tokens like the same way they do in their CAS debugger GUI?. You can of course access all the annotations from the index repository, but i want to loop on the tokens, and get all associated annotations to every token.
The reason for that is simply, I want to want to check some annotations and discard the others and in such way it is much easier. Any help is appreciated :)

I'm a uimaFIT developer.
If you want to find all annotations within the boundaries of another annotation, you may prefer the shorter and faster variant
JCasUtil.selectCovered(referenceAnnotation, <T extends ANNOTATION>);
Mind that it is not a good idea creating a "dummy" annotation with the desired offsets and then search within its boundaries, because this immediately allocates memory in the CAS which and is not garbage-collected unless the complete CAS is collected.

After searching and asking the developers of cTAKES( Apache clinical Text Analysis and Knowledge Extraction System ). you can use the following library "uimafit" which can be found on http://code.google.com/p/uimafit/ . The following code can be used
List list = JCasUtil.selectCovered(jcas, <T extends Annotation>, startIndex, endIndex);
This will return all the between the 2 indices.
Hope that will help

if you don't want to use uimaFIT, you can create a filtered iterator to loop through annotations of interest.
The UIMA reference documentation is here: UIMA reference documentation
I recently used this approach in some code to find a sentence annotation which encompassed a regex annotation (this approach was acceptable for our project because all regular expression matches were shorter than the sentences in the document, and there was only one regex match per sentence. Obviously, based on indexing rules, your mileage may vary. If you are afraid of running into another shorterAnnotationType, put the inner code into a while loop):
static ArrayList<annotationsPair> process(Annotation shorterAnnotationType,
Annotation longerAnnotationType, JCas aJCas){
ArrayList<annotationsPair> annotationsList = new ArrayList<annotationsPair>();
FSIterator it = aJCas.getAnnotationIndex().iterator();
FSTypeConstraint constraint = aJCas.getConstraintFactory().createTypeConstraint();
constraint.add(shorterAnnotationType.getType());
constraint.add(longerAnnotationType.getType());
it = aJCas.createFilteredIterator(it, constraint);
Annotation a = null;
int shorterBegin = -1;
int shorterEnd = -1;
it.moveTo((shorterAnnotationType));
while (it.isValid()) {
a = (Annotation) it.get();
if (a.getClass() == shorterAnnotationType.getClass()){
shorterBegin = a.getBegin();
shorterEnd = a.getEnd();
System.out.println("Target annotation from " + shorterBegin
+ " to " + shorterEnd);
//because assume that sentence type is longer than other type,
//the sentence gets indexed prior
it.moveToPrevious();
if(it.isValid()){
Annotation prevAnnotation = (Annotation) it.get();
if (prevAnnotation.getClass() == longerAnnotationType.getClass()){
int sentBegin = prevAnnotation.getBegin();
int sentEnd = prevAnnotation.getEnd();
System.out.println("found annotation [" + prevAnnotation.getCoveredText()
+ "] location: " + sentBegin + ", " + sentEnd);
annotationsPair pair = new annotationsPair(a, prevAnnotation);
annotationsList.add(pair);
}
//return to where you started
it.moveToNext(); //will not invalidate iter because just came from next
}
}
it.moveToNext();
}
return annotationsList;
}
Hope this helps!
Disclaimer: I am new to UIMA.

Related

Is there a standard function to determine if a String is a valid variable/funciton name in Kotlin/Java?

In Kotlin/Java, is there a standard function to determine if a String is a valid variable/function name without having to wrap it in back-ticks?
As in
functionIAmLookingFor("do-something") shouldBe false
functionIAmLookingFor("doSomething") shouldBe true
Edit: I do not want to enclose everything in backticks.
The reason why we need this: we have a tool that serializes instances into compilable Kotlin. And we have encountered the following edge case:
enum class AssetType { TRANSFERRABLE, `NON-TRANSFERRABLE` }
so as we reflect an instance with a field NON-TRANSFERRABLE, we need to wrap it in back-ticks:
val actual = MyAsset( type = `NON-TRANSFERRABLE`
This is why I'm asking this. Currently we are just saying in README that we do not support any names that require back-ticks at this time.

You could do it manually:
https://docs.oracle.com/javase/7/docs/api/java/lang/Character.html#isJavaIdentifierPart(char)
https://docs.oracle.com/javase/7/docs/api/java/lang/Character.html#isJavaIdentifierStart(char)
Something like this:
boolean isJavaIdentifier(String s) {
if (s == null || s.isEmpty()) return false;
if (!Character.isJavaIdentifierStart(s.charAt(0)) {
return false;
}
for (int i = 1, n = s.length(); i < n; ++i) {
if (!Character.isJavaIdentifierPart(s.charAt(i)) {
return false;
}
}
return true;
}
I don't know for Kotlin, but I don't think there is much difference using
770grappenmaker answer as reference.

I took a quick look at the kotlin compiler lexer. It has some predefined variables, here is an excerpt:
LETTER = [:letter:]|_
IDENTIFIER_PART=[:digit:]|{LETTER}
PLAIN_IDENTIFIER={LETTER} {IDENTIFIER_PART}*
ESCAPED_IDENTIFIER = `[^`\n]+`
IDENTIFIER = {PLAIN_IDENTIFIER}|{ESCAPED_IDENTIFIER}
FIELD_IDENTIFIER = \${IDENTIFIER}
Source: https://github.com/JetBrains/kotlin/blob/master/compiler/psi/src/org/jetbrains/kotlin/lexer/Kotlin.flex
These seem to be some kind of regexes, you could combine them to your needs and just match on them. As far as I can tell, this is how the compiler validates identifier names.
Edit: of course, this is code of the lexer, which means that if it finds any other token, it is invalid. All tokens and how to identify them are defined in that file and in the KtTokens file. You could use this information as a reference to find illegal tokens. For java, use the answer of NoDataFound.

Leave entities as-is when parsing XML with Woodstox

I'm using Woodstox to process an XML that contains some entities (most notably >) in the value of one of the nodes. To use an extreme example, it's something like this:
<parent> < > & " &apos; </parent>
I have tried a lot of different configuration options for both WstxInputFactory (IS_REPLACING_ENTITY_REFERENCES, P_TREAT_CHAR_REFS_AS_ENTS, P_CUSTOM_INTERNAL_ENTITIES...) and WstxOutputFactory, but no matter what I try, the output is always something like this:
<parent>nbsp; < nbsp; > & " ' nbsp;</parent>
(> gets converted to >, < stays the same, loses the &...)
I'm reading the XML with an XMLEventReader created with
XMLEventReader reader = wstxInputFactory.createXMLEventReader(new StringReader(fulltext));
after configuring the WstxInputFactory.
Is there any way to configure Woodstox to just ignore all entities and output the text exactly as it was in the input String?

First of all, you need to include actual code since "output is always something like this" makes no sense without explaining exactly how are you outputting content that is parsed: you may be printing events, using some library, or perhaps using Woodstox stream or event writer.
Second: there is difference in XML between small number of pre-defined entities (lt, gt, apos, quot, amp), and arbitary user-defined entities like what nbsp here would be. Former you can use as-is, they are already defined; latter only exist if you define them in DTD.
Handling of the two groups is different, too; former will always be expanded no matter what, and this is by XML specification. Latter will be resolved (unless resolution disabled), and then expanded -- or if not defined exception will be thrown.
You can also specify custom resolver as mention by the other answer; but this will only be used for custom entities (here, ).
In the end it is also good to explain not what you are doing as much as what you are trying to achieve. That will help suggest things better than specific questions of "how do I do X" which may not be the ways to go about.
And as to configuration of Woodstox, maybe this blog entry:
https://medium.com/#cowtowncoder/configuring-woodstox-xml-parser-woodstox-specific-properties-1ce5030a5173
will help (as well as 2 others in the series) -- it covers existing configuration settings.

The basic five XML entities (quot, amp, apos, lt, gt) will be always processed. As far as I know there is no way to get the source of them with Sax.
For the other entities you can process them manually. You can capture the events until the end of the element and concatenate the values:
XMLInputFactory factory = WstxInputFactory.newInstance();
factory.setProperty(XMLInputFactory.IS_REPLACING_ENTITY_REFERENCES, Boolean.FALSE);
XMLEventReader xmlr = factory.createXMLEventReader(
this.getClass().getResourceAsStream(xmlFileName));
String value = "";
while (xmlr.hasNext()) {
XMLEvent event = xmlr.nextEvent();
if (event.isCharacters()) {
value += event.asCharacters().getData();
}
if (event.isEntityReference()) {
value += "&" + ((EntityReference) event).getName() + ";";
}
if (event.isEndElement()) {
// Assign it to the right variable
System.out.println(value);
value = "";
}
}
For your example input:
<parent> < > & " &apos; </parent>
The output will be:
< > & " '
Otherwise if you want to convert all the entities maybe you could use a custom XmlResolver for undeclared entities:
public class NaiveHtmlEntityResolver implements XMLResolver {
private static final Map<String, String> ENTITIES = new HashMap<>();
static {
ENTITIES.put("nbsp", " ");
ENTITIES.put("apos", "'");
ENTITIES.put("quot", "\"");
// and so on
}
#Override
public Object resolveEntity(String publicID,
String systemID,
String baseURI,
String namespace) throws XMLStreamException {
if (publicID == null && systemID == null) {
return ENTITIES.get(namespace);
}
return null;
}
}
And then tell Woodstox to use it for the undeclared entities:
factory.setProperty(WstxInputProperties.P_UNDECLARED_ENTITY_RESOLVER, new NaiveHtmlEntityResolver());

Test if object was properly created

I'm putting more attention into unit tests these days and I got in a situation for which I'm not sure how to make a good test.
I have a function which creates and returns an object of class X. This X class is part of the framework, so I'm not very familiar with it's implementation and I don't have freedom as in the case of my "regular collaborator classes" (the ones which I have written). Also, when I pass some arguments I cannot check if object X is set to right parameters and I'm not able to pass mock in some cases.
My question is - how to check if this object was properly created, that is, to check which parameters were passed to its constructor? And how to avoid problem when constructor throws an exception when I pass a mock?
Maybe I'm not clear enough, here is a snippet:
public class InputSplitCreator {
Table table;
Scan scan;
RegionLocator regionLocator;
public InputSplitCreator(Table table, Scan scan, RegionLocator regionLocator) {
this.table = table;
this.scan = scan;
this.regionLocator = regionLocator;
}
public InputSplit getInputSplit(String scanStart, String scanStop, Pair<byte[][], byte[][]> startEndKeys, int i) {
String start = Bytes.toString(startEndKeys.getFirst()[i]);
String end = Bytes.toString(startEndKeys.getSecond()[i]);
String startSalt;
if (start.length() == 0)
startSalt = "0";
else
startSalt = start.substring(0, 1);
byte[] startRowKey = Bytes.toBytes(startSalt + "-" + scanStart);
byte[] endRowKey = Bytes.toBytes(startSalt + "-" + scanStop);
TableSplit tableSplit;
try {
HRegionLocation regionLocation = regionLocator.getRegionLocation(startEndKeys.getFirst()[i]);
String hostnamePort = regionLocation.getHostnamePort();
tableSplit = new TableSplit(table.getName(), scan, startRowKey, endRowKey, hostnamePort);
} catch (IOException ex) {
throw new HBaseRetrievalException("Problem while trying to find region location for region " + i, ex);
}
return tableSplit;
}
}
So, this creates an InputSplit. I would like to know whether this split is created with correct parameters. How to do that?

If the class is part of a framework, then you shouldn't test it directly, as the framework has tested it for you. If you still want to test the behaviour of this object, look at the cause-reaction this object would cause. More specifically: mock the object, have it do stuff and check if the affected objects (which you can control) carry out the expected behaviour or are in the correct state.
For more details you should probably update your answer with the framework you're using and the class of said framework you wish to test

This is possibly one of those cases where you shouldn't be testing it directly. This object is supposedly USED for something, yes? If it's not created correctly, some part of your code will break, no?
At some point or another, your application depends on this created object to behave in a certain way, so you can test it implicitly by testing that these procedures that depend on it are working correctly.
This can save you from coupling more abstract use cases from the internal workings and types of the framework.

What is the best way to perform index ranged searches in OrientDB from Java?

We are using OrientDB in the embedded mode, and are hoping to access it directly with Java api calls (not using the SQL-ish language). We have an index, and need to perform a ranged search on it. Here is the only way I have found so far:
String startAt = createInternalOIndexSearchableKey(actualKey);
Index<Edge> index = graph.getIndex(indexName, Edge.class);
OrientIndex orientIndex = (OrientIndex) index;
OIndex oIndex = orientIndex.getUnderlying();
boolean INCLUSIVE = true;
boolean ASCENDING = true;
OIndexCursor cursor = oIndex.iterateEntriesMajor(startAt, INCLUSIVE, ASCENDING);
while(cursor.hasNext())
{
Entry<Object, OIdentifiable> entry = cursor.nextEntry();
...process the entry here
It feels uncomfortable to be deviating so far from the normal public API. Especially the implementation of createInternalOIndexSearchableKey:
private String createInternalOIndexSearchableKey(String actualKey)
{
// NOTE: Keys passed to OIndex.iterateEntriesMajor must
// be in the (undocumented) format: EdgeLabel!=!ActualKey
return KEY_CAN_DOWNLOAD_PUBCODETIMESTAMP + "!=!" + actualKey;
}
Is there a better way to do this?

OIndex and OIndexCursor is a public api of Document database, so no worry, you can use it.
However the main aim of API is to provide flexibility to SQL engine and other internal components, so it is not very convenient.
I would recommend you to use sql queries, they provide the same level of flexibility and more compact, that make their use more convenient.

Rest-Assured XSD References Other XSDs

I am programming an XML validator according to schemas using Rest-Assured. However, I am having trouble handling XSDs that reference other XSDs, because I retrieve the original XSD from a URL using GET.
I have been trying to implement my own parsing to consolidate the XSDs(Strings) into one XSD(String), but it is becoming a recursive monster, and extremely inefficient/difficult. To see the algorithm, look at the end of the post.
I have two questions:
1) My problem is that I am using GET to retrieve the XSD, so it's not within the namespace. Is there a way to GET all referenced XSDs and consolidate them using Rest-Assured? I wouldn't have a clue about how to go about this.
2) Is there a better way to handle includes in general? As you can see, my algorithm is very costly and overcomplicated (especially the ref attribute), and I'm sure something will break easily if I change my test cases.
My algorithm(Pseudo-Code to avoid complexity) so far is like the following:
boolean xmlValid(String xmlAddress, String xsdAddress){
LinkedList XSDList = new LinkedList;
XSDList.add(xsdAddress);
xsdString = getExternalXSDStrings(XSDList);
try{ //No PseudoCode here
RestAssured.expect().
statusCode(200).
body(
RestAssuredMatchers.matchesXsd(xsdString)).
when().
get(xmlAddress);
}catch Exceptions{...}
}
String getExternalXSDStrings(LinkedList xsdReferences, String prevString){
LinkedList recursiveXSDReferences = new LinkedList();
for(xsdRef:xsdReferences){
xsdAddress = "http://..." + xsdRef;
Open InputStream From URL;
while(inputLine != null){
if(prologFlag) //Do Nothing, this is to avoid multiple prologs ;
else if(includeFlag){
if(refFlag) Note Reference;
else recursiveXSDReferences.add(includeReference);
}else if(refFlag){
referenceDefinition = Extract Reference Element Definition;
xsdString = xsdString + referenceDefinition;
}else{
xsdString = xsdString + inputLine;
}
}
Close input stream;
}
xsdString = prevString + xsdString;
if(xsdReferences.length > 0) return getExternalXSDStrings(recursiveXSDReferences , xsdString);
else return xsdString;
}
Thank you very much in advance!

Perhaps can make use of the XmlConfig in detailed configuration. This gives you access to configure features and namespaces etc. For example if you want to disable the loading of external DTD's you could do:
given().config(RestAssured.config().xmlConfig(xmlConfig().disableLoadingOfExternalDtd())). ..
So perhaps you could look in the "disableLoadingOfExternalDtd" method to see how it's implemented to get some hints.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Accessing annotations in UIMA - java

Related

Is there a standard function to determine if a String is a valid variable/funciton name in Kotlin/Java?

Leave entities as-is when parsing XML with Woodstox

Test if object was properly created

What is the best way to perform index ranged searches in OrientDB from Java?

Rest-Assured XSD References Other XSDs

Categories

Resources