I've written a java program that ingests data from a .csv, and converts those data into RDFXML. I used sesame's framework when writing this program, and the program successfully does what it was written to do.
However, I am trying to unit test this program using jUnit, and I need to test a method which converts RDF triples (in turtle format) to RDFXML. To show that the method works correctly, I would like to do this by converting RDFXML back into triples and comparing them to the original triples I passed into the method. So far, I have not found anything in sesame's documentation does this. Any suggestions?
I just solved the problem a few minutes ago. Here's my solution:
#Test
public void testWriteStmtToRDFPos(){
RDFParser parser = new RDFXMLParser();
String baseURI = "";
Model origStmts = new LinkedHashModel();
Model processedStmts = new LinkedHashModel();
StatementCollector collector = new StatementCollector(processedStmts);
parser.setRDFHandler(collector);
origStmts.add(sexOffend,predicate,object);
try{
converter.writeStmtToRDF(origStmts, rdfFile);
FileReader reader = new FileReader(rdfFile);
parser.parse(reader, baseURI);
if(origStmts.equals(processedStmts)){
assert(true);
}
}catch(FileNotFoundException e){
e.printStackTrace();
fail();
}catch(Exception e){
e.printStackTrace();
fail();
}
}
When you set the collector for the parser above, it simply collects any statements that the parser ingests. After doing this, you can compare the collector with origStmts. This wasn't immediately obvious, but is really useful after finding it!
Related
I'm trying to write some unit tests for Kafka Streams and have a number of quite complex schemas that I need to incorporate into my tests.
Instead of just creating objects from scratch each time, I would ideally like to instantiate using some real data and perform tests on that. We use Confluent with records in Avro format, and can extract both schema and a text JSON-like representation from the Control Center application. The JSON is valid JSON, but it's not really in the form that you'd write it in if you were just writing JSON representations of the data, so I assume it's some representation of the underlying AVRO in text form.
I've already used the schema to create a Java SpecificRecord class (price_assessment) and would like to use the JSON string copied from the Control Center message to populate a new instance of that class to feed into to my unit test InputTopic.
The code I've tried so far is
var testAvroString = "{JSON copied from Control Center topic}";
Schema schema = price_assessment.getClassSchema();
DecoderFactory decoderFactory = new DecoderFactory();
Decoder decoder = null;
try {
DatumReader<price_assessment> reader = new SpecificDatumReader<price_assessment>();
decoder = decoderFactory.get().jsonDecoder(schema, testAvroString);
return reader.read(null, decoder);
} catch (Exception e)
{
return null;
}
which is adapted from another SO answer that was using GenericRecords. When I try running this though I get the exception Cannot invoke "org.apache.avro.Schema.equals(Object)" because "writer" is null on the reader.read(...) step.
I'm not massively familiar with streams testing or Java and I'm not sure what exactly I've done wrong. Written in Java 17, streams 3.1.0, though flexible with version
The solution that I've managed to come up with is the following, which seems to work:
private static <T> T avroStringToInstance(Schema classSchema, String testAvroString) {
DecoderFactory decoderFactory = new DecoderFactory();
GenericRecord genericRecord = null;
try {
Decoder decoder = decoderFactory.jsonDecoder(classSchema, testAvroString);
DatumReader<GenericData.Record> reader =
new GenericDatumReader<>(classSchema);
genericRecord = reader.read(null, decoder);
} catch (Exception e)
{
return null;
}
var specific = (T) SpecificData.get().deepCopy(genericRecord.getSchema(), genericRecord);
return specific;
}
My homework assignment is to read a URL and print all hyperlinks at that URL to a file. I also need to submit a junit test case with at least one assertion. I have looked at the different forms of Assert but I just can't come up with any use of them that applies to my code. Any help steering me in the right direction would be great.
(I'm not looking for anyone to write the test case for me, just a little guidance on what direction I should be looking in)
public void saveHyperLinkToFile(String url, String fileName)
throws IOException
{
URL pageLocation = new URL(url);
Scanner in = new Scanner(pageLocation.openStream());
PrintWriter out = new PrintWriter(fileName);
while (in.hasNext())
{
String line = in.next();
if (line.contains("href=\"http://"))
{
int from = line.indexOf("\"");
int to = line.lastIndexOf("\"");
out.println(line.substring(from + 1, to));
}
}
in.close();
out.close();
}
}
Try to decompose your method into simpler ones:
List<URL> readHyperlinksFromUrl(URL url);
void writeUrlsToFile(List<URL> urls, String fileName);
You could already test your first method by saving a sample document as a resource and running it against that resource, comparing the result with the known list of URLs.
You can also test the second method by re-reading that file.
But you can decompose things further on:
void writeUrlsToWriter(List<URL> urls, Writer writer);
Writer createFileWriter(String fileName);
Now you can test your first method by writing to a StringWriter and checking, what was written there by asserting the equality of writer.toString() with the sample value. Not that methods are becoming simpler and simpler.
It would be actually a very good excercise to write the whole thing test-first or even play ping-pong with yourself.
Good luck and happy coding.
Before I proceed to my question : please note that I am not working on any client-server application that would require serialization, but the program I am trying to customize stores one big instance of one big class in a .dat file. I have read about this issue (memory leak in ObjectOutputStream and ObjectInputStream)and the fact that I could probably need to :
use the ObjectOutputStream.reset() method after writing the class instance in the .dat file, so that it doesn't hold the reference anymore;
re-write the code without using serialization;
split the file and read it in chunks;
change the JVM memory parameter by using -Xmx;
So, I was provided with one class that generates a language model and saves it with a .dat extension; the code was probably optimized for small model files (there are 2 model files provided as examples, both around 10MB ), but I generated a much larger model class, and it is around 40MB. Then, there is another class in another folder, totally independent on the first one, that uses this model, and the model has to be loaded using ObjectInputStream. Here comes the problem : a classic "OutOfMemoryError : Java heap space".
Writing the object:
try {
// Create an output stream to the file.
FileOutputStream file_output = new FileOutputStream (file);
ObjectOutputStream o = new ObjectOutputStream( file_output );
o.writeObject(this);
file_output.close ();
}
catch (IOException e) {
System.err.println ("IO exception = " + e );
}
Reading the object:
InputStream model = null;
ModelGeneration oRead = null;
ObjectInputStream p = null;
try {
model = new FileInputStream(filename);
BufferedInputStream buf = new BufferedInputStream(model);
p = new ObjectInputStream(buf);
oRead = (ModelGeneration) p.readObject();
p.reset();
} catch (IOException e) {
e.printStackTrace();
} catch (ClassNotFoundException e) {
e.printStackTrace();
} finally {
try {
model.close();
} catch (Exception e) {
e.printStackTrace();
}
}
I tried to use the reset() method, but it is useless because we load only one instance of one class at a time, nothing else needed. This is why I can't split the file, too: only one class instance is stored in the .dat file.
Changing the heap space seems like a worse solution than optimizing the code.
I would really appreciate your advice on what I can do.
Btw the code is here : http://svn.apache.org/repos/asf/uima/addons/trunk/Tagger/, I only implemented the required classes for a different language.
P.S. Works fine if I create a smaller model, but I would prefer the bigger one.
If i use a query like this in command line
./opennlp TokenNameFinder en-ner-person.bin "input.txt" "output.txt"
I'll get person names printed in output.txt but I want to write own models such that i should print my own entities.
E.g.
what is the risk value on icm2500.
Delivery of prd_234 will be arrived late.
Watson is handling router_34.
If i pass these lines, it should parse and extract product_entities. icm2500, prd_234, router_34... etc these are all Products( we can save this information in a file and we can use it as look up kind of for models or openNLP).
Can anyone please tel me how to do this ?
You'll need to train your own model by annotating some sentences in the opennlp format. For the example sentences you posted the format would look like this:
what is the risk value on <START:product> icm2500 <END>.
Delivery of <START:product> prd_234 <END> will be arrived late.
Watson is handling <START:product> router_34 <END>.
Make sure each sentence ends in a newline and if there are newlines in the sentence to escape them somehow.
Once you make a file like this out of your data, then you can use the Java API to train the model like this
public static void main(String[] args){
Charset charset = Charset.forName("UTF-8");
ObjectStream<String> lineStream =
new PlainTextByLineStream(new FileInputStream("your file in the above format"), charset);
ObjectStream<NameSample> sampleStream = new NameSampleDataStream(lineStream);
TokenNameFinderModel model;
try {
model = NameFinderME.train("en", "person", sampleStream, TrainingParameters.defaultParams(),
null, Collections.<String, Object>emptyMap());
}
finally {
sampleStream.close();
}
try {
modelOut = new BufferedOutputStream(new FileOutputStream(modelFile));
model.serialize(modelOut);
} finally {
if (modelOut != null)
modelOut.close();
}
}
now you can use the model with the namefinder.
Because you may have a definitive, and possibly short, list of product names, you might consider a simple regex approach.
here's the opennlp docs that cover the NameFinder a bit:
http://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html#tools.namefind.training.tool
I read the claims from Sun people about the wonderful space economy of not only using FastInfoSet, but using it with an external vocab. The code for this purpose is include in the most recent version (1.2.8) but it is not exactly fully documented.
For many files, this works just great for me. However, we've come up with an XML file which, when serialized from DOM with the vocab I created (using the generator in the FI library), and then read back into DOM, mismatches. The mismatches are all in PC-data.
I just call setVocabulary on the serializer and setExternalVocabulary with a map from URI to vocabulary on the reader.
I had to invent my own mechanism to actually serialize a vocabulary; there didn't seem to be one anywhere in the FI library.
One fiddly bit of business is that the org.jvnet.fastinfoset.Vocabulary class is what the generator gives you, but it's not what the parsers and serializers eat. I made arrangements to serialize these, and then use the code below to turn them into the needed objects:
private static void initializeAnalysis() {
InputStream is = FastInfosetUtils.class.getResourceAsStream(ANALYSIS_VOCAB_CLASSPATH);
try {
ObjectInputStream ois = new ObjectInputStream(is);
analysisJvnetVocab = (SerializableVocabulary) ois.readObject();
ois.close();
} catch (IOException e) {
throw new RuntimeException(e);
} catch (ClassNotFoundException e) {
throw new RuntimeException(e);
}
analysisSerializerVocab = new SerializerVocabulary(analysisJvnetVocab.getVocabulary(), false);
analysisParserVocab = new ParserVocabulary(analysisJvnetVocab.getVocabulary());
}
and then, to actually write a document:
SerializerVocabulary fullVocab = new SerializerVocabulary();
fullVocab.setExternalVocabulary(ANALYSIS_VOCAB_URI, analysisSerializerVocab, false);
// pass fullVocab to setVocabulary.
and to read:
Map<Object, Object> vocabMap = new HashMap<Object, Object>();
vocabMap.put(ANALYSIS_VOCAB_URI, analysisParserVocab);
// pass map into setExternalVocabulary
I could easily imagine that the recipe for creating serialization vocabularies is not right, it's not like I was reading a tutorial. Anyone happen to know?
UPDATE
Since no one 'round here had anything to add to this question, I make a test case and filed a bug report. Somewhat to my surprise, it turned out that it was, in fact, a bug, and a fix has been made.