Text normalization in python using Normalizer.Form.NFKD

Text normalization in python using Normalizer.Form.NFKD - java

A field in the table is normalized using Java as shown below,
String name = customerRecord.getName().trim();
name = name.replaceAll("œ", "oe");
name = name.replaceAll("æ", "ae");
name = Normalizer.normalize(name, Normalizer.Form.NFKD).replaceAll("[^\\p{ASCII}]", "");
name = name.toLowerCase();
Now I'm trying to query the same db using Python. How do I do Normalizer.normalize(name, Normalizer.Form.NFKD) in Python so that it is compatible with the way it is written to?

An almost complete translation of the above Java code to Python would be like as follows,
import unicodedata
ASCII_REPLACEMENTS = {
'œ': 'oe',
'æ': 'ae'
}
text = ''.join([ASCII_REPLACEMENTS.get(c, c) for c in search_term])
ascii_term = (
unicodedata.normalize('NFKD', text).
encode('ascii', errors='ignore').decode()
)
return ascii_term.lower()
ASCII_REPLACEMENTS should be amended with what ever characters that wont get translated correctly by unicodedata.normalize compared to Java's Normalizer.normalize(name, Normalizer.Form.NFKD). This way we can ensure the compatibility between the two.

Related

How can I translate a TupleExpr or a ParsedTupleQuery into the Query String?

I want to parse a query using rdf4j's SPARQLParser, modify the underlying query tree (=TupleExpr) and translate it back into a query string. Is there a way to do that with rdf4j?
I tried the following but it didn't work
SPARQLParser parser = new SPARQLParser();
ParsedQuery originalQuery = parser.parseQuery(query, null);
if (originalQuery instanceof ParsedTupleQuery) {
TupleExpr queryTree = originalQuery.getTupleExpr();
queryTree.visit(myQueryModelVisitor());
originalQuery.setTupleExpr(queryTree);
System.out.println(queryTree);
ParsedQuery tsQuery = new ParsedTupleQuery(queryTree);
System.out.println(tsQuery.getSourceString());
}
the printed output is null.

You'll want to use the org.eclipse.rdf4j.queryrender.sparql.experimental.SparqlQueryRenderer which is specifically designed to transform a TupleExpr back into a SPARQL query string.
Roughly, like this:
SPARQLParser parser = new SPARQLParser();
ParsedQuery originalQuery = parser.parseQuery(query, null);
if (originalQuery instanceof ParsedTupleQuery) {
TupleExpr queryTree = originalQuery.getTupleExpr();
queryTree.visit(myQueryModelVisitor());
originalQuery.setTupleExpr(queryTree);
System.out.println(queryTree);
ParsedQuery tsQuery = new ParsedTupleQuery(queryTree);
String transformedQuery = new SparqlQueryRenderer().render(tsQuery);
}
Note that this component is still experimental, and does not have guaranteed complete coverage of all SPARQL 1.1 features.
As an aside, the reason getSourceString() does not work here is that method is designed to return the input source string from which the parsed query was generated. Since in your case you've just created a new ParsedQuery object from scratch, there is no source string.

Java Spark: How to get value from a column which is JSON formatted string for entire dataset?

Needs some help here. I am trying to read data from Hive/CSV. There is a column whose type is string and the value is json formatted string. It is something like this:
| Column Name A |
|----------------------------------------------------------|
|"{"key":{"data":{"key_1":{"key_A":[123]},"key_2":[456]}}}"|
How can I get the value of key_2 and insert it to a new column?
I tried to create a new function to the get value via Gson
private BigDecimal getValue(final String columnValue){
JsonObject jsonObject = JsonParser.parseString(columnValue).getAsJsonOBject();
return jsonObject.get("key").getAsJsonObject().get("key_1").getAsJsonObject().get("key_2").getAsJsonArray().get(0).getAsBigDecimal();
}
But how i can apply this method to the whole dataset?
I was trying to achieve something like this:
Dataset<Row> ds = souceDataSet.withColumn("New_column", getValue(sourceDataSet.col("Column Name A")));
But it cannot be done as the data types are different...
Could you please give any suggestions?
Thx!
hx!
------------------Update---------------------
As #Mck suggested, I used get_json_object.
As my value contains "
"{"key":{"data":{"key_1":{"key_A":[123]},"key_2":[456]}}}"
I used substring to removed " and make the new string like this
{"key":{"data":{"key_1":{"key_A":[123]},"key_2":[456]}}}
Code for substring
DataSet<Row> dsA = sourceDataSet.withColumn("Column Name A",expr("substring(Column Name A, 2, length(Column Name A))"))
I used dsA.show() and confirmed the dataset looks correct.
Then I used following code try to do it
Dataset<Row> ds = dsA.withColumn("New_column",get_json_object(dsA.col("Column Name A"), "$.key.data.key_2[0]"));
which returns null.
However, if the data is this:
{"key":{"data":{"key_2":[456]}}}
I can get value 456.
Any suggestions why I get null?
Thx for the help!

Use get_json_object:
ds.withColumn(
"New_column",
get_json_object(
col("Column Name A").substr(lit(2), length(col("Column Name A")) - 2),
"$.key.data.key_2[0]")
).show(false)
+----------------------------------------------------------+----------+
|Column Name A |New_column|
+----------------------------------------------------------+----------+
|"{"key":{"data":{"key_1":{"key_A":[123]},"key_2":[456]}}}"|456 |
+----------------------------------------------------------+----------+

Java Tensorflow + Keras Equivalent of model.predict()

In python you can simply pass a numpy array to predict() to get predictions from your model. What is the equivalent using Java with a SavedModelBundle?
Python
model = tf.keras.models.Sequential([
# layers go here
])
model.compile(...)
model.fit(x_train, y_train)
predictions = model.predict(x_test_maxabs) # <= This line
Java
SavedModelBundle model = SavedModelBundle.load(path, "serve");
model.predict() // ????? // What does it take as in input? Tensor?

TensorFlow Python automatically convert your NumPy array to a tf.Tensor. In TensorFlow Java, you manipulate tensors directly.
Now the SavedModelBundle does not have a predict method. You need to obtain the session and run it, using the SessionRunner and feeding it with input tensors.
For example, based on the next generation of TF Java (https://github.com/tensorflow/java), your code endup looking like this (note that I'm taking a lot of assumptions here about x_test_maxabs since your code sample does not explain clearly where it comes from):
try (SavedModelBundle model = SavedModelBundle.load(path, "serve")) {
try (Tensor<TFloat32> input = TFloat32.tensorOf(...);
Tensor<TFloat32> output = model.session()
.runner()
.feed("input_name", input)
.fetch("output_name")
.run()
.expect(TFloat32.class)) {
float prediction = output.data().getFloat();
System.out.println("prediction = " + prediction);
}
}
If you are not sure what is the name of the input/output tensor in your graph, you can obtain programmatically by looking at the signature definition:
model.metaGraphDef().getSignatureDefMap().get("serving_default")

You can try Deep Java Library (DJL).
DJL internally use Tensorflow java and provide high level API to make it easy fro inference:
Criteria<Image, Classifications> criteria =
Criteria.builder()
.setTypes(Image.class, Classifications.class)
.optModelUrls("https://example.com/squeezenet.zip")
.optTranslator(ImageClassificationTranslator
.builder().addTransform(new ToTensor()).build())
.build();
try (ZooModel<Image, Classification> model = ModelZoo.load(criteria);
Predictor<Image, Classification> predictor = model.newPredictor()) {
Image image = ImageFactory.getInstance().fromUrl("https://myimage.jpg");
Classification result = predictor.predict(image);
}
Checkout the github repo: https://github.com/awslabs/djl
There is a blogpost: https://towardsdatascience.com/detecting-pneumonia-from-chest-x-ray-images-e02bcf705dd6
And the demo project can be found: https://github.com/aws-samples/djl-demo/blob/master/pneumonia-detection/README.md

In 0.3.1 API:
val model: SavedModelBundle = SavedModelBundle.load("path/to/model", "serve")
val inputTensor = TFloat32.tesnorOf(..)
val function: ConcreteFunction = model.function(Signature.DEFAULT_KEY)
val result: Tensor = function.call(inputTensor) // u can cast to type you expect, a type of returning tensor can be checked by signature: model.function("serving_default").signature().toString()
After you got a result Tensor of any subtype, you can iterate over its values. In my example, I had a TFloat32 with shape (1, 56), so I found max value by result.get(0, idx)

Jena Model converts my RDF type explicit declaration to implicit and messes with the format

I have the following code that creates an RDF resource with some set properties and prints it on console.
String uri = "http://krweb/";
String name = "Giorgos Georgiou";
String phone = "6976067554";
String age = "27";
String department = "ceid";
String teaches = "java";
Model model = ModelFactory.createOntologyModel();
model.setNsPrefix("krweb", uri);
Resource giorgosgeorgiou = model.createResource(uri+name.toLowerCase().replace(" ", ""), model.createResource(uri+"Professor"));
Property has_name = model.createProperty(uri+"has_name");
Property has_phone = model.createProperty(uri+"has_phone");
Property has_age = model.createProperty(uri+"has_age");
Property member_of = model.createProperty(uri+"member_of");
Property teach = model.createProperty(uri+"teaches");
giorgosgeorgiou.addProperty(teach, model.createResource(uri+teaches));
giorgosgeorgiou.addProperty(member_of, model.createResource(uri+department));
giorgosgeorgiou.addProperty(has_age,age);
giorgosgeorgiou.addProperty(has_phone,phone);
giorgosgeorgiou.addProperty(has_name,name);
//giorgosgeorgiou.addProperty(RDF.type, model.createResource(uri+"Professor"));
model.write(System.out,"RDF/XML");
I want the model printed in this format:
<rdf:Description rdf:about="http://krweb/giorgosgeorgiou">
<rdf:type rdf:resource="http://krweb/Professor"/>
<krweb:has_name>Giorgos Georgiou</krweb:has_name>
<krweb:has_phone>6976067554</krweb:has_phone>
<krweb:has_age>27</krweb:has_age>
<krweb:member_of rdf:resource="http://krweb/ceid"/>
<krweb:teaches rdf:resource="http://krweb/java" />
</rdf:Description>
Instead I get this:
<krweb:Professor rdf:about="http://krweb/giorgosgeorgiou">
<krweb:has_name>Giorgos Georgiou</krweb:has_name>
<krweb:has_phone>6976067554</krweb:has_phone>
<krweb:has_age>27</krweb:has_age>
<krweb:member_of rdf:resource="http://krweb/ceid"/>
<krweb:teaches rdf:resource="http://krweb/java"/>
</krweb:Professor>
Somehow, the rdf type property gets converted to some implicit declaration and is presented in what I suppose is a "pretty" format. Is there a way to bypass this?

Internally the RDF data is held as triples - no knowledge of how they were formatted on input is stored.
The default output is pretty RDF/XML.
To get the plain, flat format use RDFFormat.RDFXML_PLAIN
RDFDataMgr.write(System.out, model, RDFFormat.RDFXML_PLAIN);

OrientDB - Java create a field as EMBEDDEDLIST type

I try to create a field as type EMBEDDEDLIST from Java.
But when I try to create it, is considered as LINK.
If I define a field by Studio as EMBEDDELIST with linked class, Java works properly.
My code:
String fieldName = "trialEmbedded";
List<ODocument> fieldDataItem = doc.getData().field(fieldName);
DataItem di = DataItemFactory.create(dtValidita, importo, descrizione, db);
if (fieldDataItem == null) {
fieldDataItem = new ArrayList<ODocument>();
}
fieldDataItem.add(di.getData());
doc.setField(fieldName, fieldDataItem);
In the doc variable (type ODocument) when I save it, on DB (querying by Studio) I've got in column "trialEmbedded" a link (orange box with #rid clickable), if I specify field as EMBEDDEDLIST works properly.

I resolved in very simple way.
I used the signature of setField with OType parameter, like this:
this.data.field(fieldName, fieldDataItem, OType.EMBEDDEDLIST);

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Text normalization in python using Normalizer.Form.NFKD - java

Related

How can I translate a TupleExpr or a ParsedTupleQuery into the Query String?

Java Spark: How to get value from a column which is JSON formatted string for entire dataset?

Java Tensorflow + Keras Equivalent of model.predict()

Jena Model converts my RDF type explicit declaration to implicit and messes with the format

OrientDB - Java create a field as EMBEDDEDLIST type

Categories

Resources