Getting values from SAX attributes when namespaces are involved - java

I'm using SAX to parse some XML. In my handler's startElement() method I'm trying to read the value of an attribute named xsi:type with something like:
String type = attributes.getValue("xsi:type");
However, it always returns null. This works fine for everything else so I'm assuming that it's due to the namespace prefix. How can I get this value?

Probably this can help, try to play a little with this. This will return the names and the value of the attributes found which can be useful to find the name to use to query.
if (attributes.getLength() > 0) {
for (int i = 0; i < attributes.getLength(); i++) {
System.out.print ("name: " + attributes.getQName(i)));
System.out.println(" value: " + attributes.getValue(i)));
}
}
Take also a look here and here check the function: getURI

Try asking SAX what it thinks the attribute's qName is:
for (int i=0; i < attributes.getLength(); i++) {
String qName = attributes.getQName(i);
System.out.println("qName for position " + i + ": " + qName);
}

Related

How to turn a doc with 10k independent JSON objects into a JSON array in Java

I am currently working on migrating a database from a non-SQL source to an SQL database. The non-SQL source outputs the data in a JSON doc that is just a series of independent JSON objects. I an using JSONObject within Java, and that (to my understanding) can only recognize the top most object within the document. In order to get around this issue I am writing code to convert the independent objects into an array.
The current method I am using involves converting the JSON doc into a string, counting curly brackets to find objects, and then inserting them into an array.
for (int i = 0; i < doc.length(); i++) {
char currentChar = doc.charAt(i);
if (currentChar == '{') {
Integer jsonStart = i;
Integer openBrace = 1;
Integer closeBrace = 0;
while (openBrace > closeBrace) {
i++;
currentChar = doc.charAt(i);
if (currentChar == '{') {
openBrace++;
}
if (currentChar == '}') {
closeBrace++;
}
}
Integer jsonEnd = i;
String currentString = doc.substring(jsonStart, jsonEnd + 1);
JSONObject currentJSONObject = new JSONObject(currentString);
returnJSONArray.put(currentJSONObject);
Due to size, the database had to be divided into multiple 10k object documents. The code worked well until one of the documents had braces stored within the value. So I added some code to watch for values and ignore those based on quotation marks beneath the close curly bracket counter.
if (currentChar == '"') {
i++;
currentChar = mongoExport.charAt(i);
while (!(currentChar == '"')) {
i++;
currentChar = mongoExport.charAt(i);
}
This worked for the document with value pairs that contained curly brackets, but upon testing it against the rest of the documents I experience a "String index out of range: big number" error in one of the other documents that traces back to the while loop looking for and end quotation mark. From what I can figure, this means that there are also values that contain quotation marks. I tried some code to check for escape characters before quotation marks, but that changed nothing. I can't check through these documents manually, they are far too long for that. Is there a way for me to handle these strings? Also, was there a far easier method I could have used that I was unaware of from the beginning?
Even using the java.json package doesn't require manual parsing. something like:
import java.io.ByteArrayInputStream;
import java.io.InputStream;
import javax.json.Json;
import javax.json.JsonArray;
import javax.json.JsonObject;
import javax.json.JsonReader;
import javax.json.JsonValue;
...
private static final String jsonString = "[" +
"{\n" +
"\"id\":123,\n" +
"\"name\":\"Bob Marley\",\n" +
"\"address\":{\n" +
"\"street\":\"123 Main St\",\n" +
"\"city\":\"Anytown\",\n" +
"\"state\":\"CO\",\n" +
"\"zipcode\":80205\n" +
"},\n" +
"\"phoneNumbers\":[\"3032920200\"],\n" +
"\"role\":\"Developer\"\n" +
"},\n" +
"{\n" +
"\"id\":456,\n" +
"\"name\":\"Tommy Tutone\",\n" +
"\"address\":{\n" +
"\"street\":\"456 Main St\",\n" +
"\"city\":\"Sometown\",\n" +
"\"state\":\"CO\",\n" +
"\"zipcode\":80205\n" +
"},\n" +
"\"phoneNumbers\":[\"1238675309\"],\n" +
"\"role\":\"Developer\"\n" +
"}\n" +
"]";
...
#GET
#Produces("text/plain")
public String hello() {
InputStream inputStream = new ByteArrayInputStream(jsonString.getBytes());
JsonReader jsonReader = Json.createReader(inputStream);
JsonArray jsonArray = jsonReader.readArray();
for (JsonValue jsonValue : jsonArray) {
JsonObject jsonObject = jsonValue.asJsonObject();
System.out.println("next object id is " + jsonObject.getInt("id"));
JsonObject addressObject = jsonObject.getJsonObject("address");
System.out.println("next object city is " + addressObject.getString("city"));
}
return "Hello, World!";
}
This gets the first level objects (for example, "id") and nested objects ("address" in this example). I intentionally did not create a POJO type object that would represent the JSON object - you can do that but you'll have to decide if it's worthwhile to have a full object of your data or just pull it with things like getString().

Unable to understand the HLDA Output in MALLET

Below is a snippet of my code:
HierarchicalLDA hlda = new HierarchicalLDA();
hlda.initialize(instances, instances, 5, new Randoms());
hlda.estimate(1000);
hlda.printState(new PrintWriter(new File("Data.txt")));
I am unable to understand the meaning of both the console output and what is printed in the "Data.txt" file. I have already scoured the MALLET site but haven't found anything helpful. Any help or suggestion would be greatly appreciated.
Thanks in advance!
In hLDA each document samples a path through a tree of topics. Each token exists on one "level" of that path. The printState method gives you the ids of each tree node in the path for the document, followed by information about the word: the numeric ID for the word, the string for that id, and the level in the path.
node = documentLeaves[doc];
for (level = numLevels - 1; level >= 0; level--) {
path.append(node.nodeID + " ");
node = node.parent;
}
for (token = 0; token < seqLen; token++) {
type = fs.getIndexAtPosition(token);
level = docLevels[token];
// The "" just tells java we're not trying to add a string and an int
out.println(path + "" + type + " " + alphabet.lookupObject(type) + " " + level + " ");
}

Display Stanford NER confidence score

I'm extracting named-entities from news articles with the use of Stanford NER CRFClassifier and in order to implement active learning, I would like to know what are the confidence scores of the classes for each labelled entity.
Exemple of display :
LOCATION(0.20) PERSON(0.10) ORGANIZATION(0.60) MISC(0.10)
Here is my code for extracting named-entities from a text :
AbstractSequenceClassifier<CoreLabel> classifier = CRFClassifier.getClassifierNoExceptions(classifier_path);
String annnotatedText = classifier.classifyWithInlineXML(text);
Is there a workaround to get thoses values along with the annotations ?
I've found it out by myself, in CRFClassifier's doc it is written :
Probabilities assigned by the CRF can be interrogated using either the
printProbsDocument() or getCliqueTrees() methods.
The first method is not useful since it only prints what I want on the console, but I want to be able to access this data, so I have read how this method is coded and copied a bit its behaviour like this :
List<CoreLabel> classifiedLabels = classifier.classify(sentences);
CRFCliqueTree<String> cliqueTree = classifier.getCliqueTree(classifiedLabels);
for (int i = 0; i < cliqueTree.length(); i++) {
CoreLabel wi = classifiedLabels.get(i);
for (Iterator<String> iter = classifier.classIndex.iterator(); iter.hasNext();) {
String label = iter.next();
int index = classifier.classIndex.indexOf(label);
double prob = cliqueTree.prob(i, index);
System.out.println("\t" + label + "(" + prob + ")");
}
String tag = StringUtils.getNotNullString(wi.get(CoreAnnotations.AnswerAnnotation.class));
System.out.println("Class : " + tag);
}

In java trying to extract XMLNS using a Regexpression

I have been trying for a few hours to get this right, and I really can't seem to do it...
Given a string
"xmlns:oai-identifier=\"http://www.openarchives.org/OAI/2.0/oai-identifier\""
what is the correct expression to "save" the http://www.openarchives.org/OAI/2.0/oai-identifier bit?
Thanks in advance, really having trouble getting this right.
String validXML = "<?xml version=\"1.0\" encoding=\"UTF-8\"?><feed "
+ "xmlns:oai-identifier=\"http://www.openarchives.org/OAI/2.0/oai-identifier\" "
+ "xmlns:mingo-identifier=\"http://www.google.com\" "
+ "xmlns:abeve-identifier=\"http://www.news.ycombinator.org/OAI/2.0/oai-identifier\">"
+ "</feed>";
Pattern p = Pattern.compile(".*\\\"(.*)\\\".*");
Matcher m = p.matcher(validXML);
System.out.println(m.group(1));
Is not printing out anything. Be aware that this attempt was just to get the string inside the quotes, I was going to worry about the other part once I got that working... To bad I never got that working. Thanks
Regular Expressions are so expensive - don't use them when you don't need to!! There are a million other ways to parse a string.
String validXml = "<?xml version=\"1.0\" encoding=\"UTF-8\"?><feed "
+ "xmlns:oai-identifier=\"http://www.openarchives.org/OAI/2.0/oai-identifier\" "
+ "xmlns:mingo-identifier=\"http://www.google.com\" "
+ "xmlns:abeve-identifier=\"http://www.news.ycombinator.org/OAI/2.0/oai-identifier\">"
+ "</feed>";
String start = "xmlns:oai-identifier=\"";
String end = "\" ";
int location = validXml.indexOf(start);
String result;
if (location > 0) {
result = validXml.substring(location + start.length(), validXml.length());
int endIndex = result.indexOf(end);
if (endIndex > 0) {
result = result.substring(0, endIndex);
}
else {
throw new Exception("Could not find end!");
}
}
else {
throw new Exception("Could not find start!");
}
System.out.println(result);
I think the problem might be that the first .* in your regular expression is too eager and matching more characters than you'd like.
Try changing ".*\\\"(.*)\\\".*" to be "xmlns.*=\"(.*)\".*" and see whether that works.
If it doesn't work at first, you can also try re-instating the quote escaping. Off the top of my head, I think you don't need them escaping, but I'm not 100% sure.
Note also that this will only match a single namespace declaration, not each one in the validXML variable in your example. You'll have to split the string in order to use this on an arbitrary number of xmlns:.*= attributes.
Since you are reading XML, you might be using DOM, so you can extract the namespace from the prefix name using lookupNamespaceURI() once you parse the document with the setNamespaceAware() option set to true:
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setNamespaceAware(true);
Document doc = factory.newDocumentBuilder().parse(new InputSource(new StringReader(validXML)));
String namespace = doc.lookupNamespaceURI("oai-identifier");
It's simpler and you don't have to do any string parsing.

uncaught syntax error

I am using following code for local storage.
for(int i=0; i< files.length; i++)
{
System.out.println("base = " + files[i].getName() + "\n i=" +i + "\n");
AudioFile f = AudioFileIO.read(files[i]);
Tag tag = f.getTag();
//AudioHeader h = f.getAudioHeader();
int l = f.getAudioHeader().getTrackLength();
String s1 = tag.getFirst(FieldKey.ALBUM);
out.print("writeToStorage("+s1+","+s1+");");
}
getting uncaught syntex erroe: unexpected identifer as a error.
Im guessing you meant java rather than javascript?
Your unexpected identifier is here out.println you need System. infront of it.
The reason for this is that out is not defined in your code. You need to access it by using the static variable in the System class. Hence why you use System.out.
Alternatley you could set a variable out to be equal to System.out for shorthand, although I don;t tend to. But this can allow you to switch out to a different type of output stream without having to refactor your code much.
Have you added following ?
import static java.lang.System.out;
Probably you need to output "s in the last line to surround the s1 values.
"writeToStorage("+s1+","+s1+");"
->
"writeToStorage('"+s1+"','"+s1+"');"
Btw for the same reason you have to fix the other line too:
"base = " + files[i].getName() + "...
->
"base = '" + files[i].getName() + "'...

Categories