freebase MQL query to match any field - java

My application gets in input a certain amount of String, suppose the name of the "object" I'm looking for, and other fields like the year when an artist was born or the last album he made.
By the way the application has no knowledge on the type of the object in input, so what I'm trying to do is making an MQL query that, given the name of the object and other values (in any field, as I don't know the type of what I'm querying), returns me the type of what I searched. Once I get its type, I could for example make a better query asking for specifical fields.
As we all know, an example is worth thousand words, so let's assume my input is "The Police" and "So Lonely", one of their songs. I just know "The Police" is the name of what I'm looking for, but I don't know nothing about "So Lonely", so I should insert it someway in the query to get better results, without the knowledge of its type.
My first basic query is:
[{
"name": "The Police",
"type": []
}]
and it works, but I can't refine my search including so lonely, that could narrow the search output.
Any hint?

Another thing I need to accomplish, once I do the above (and still no idea on how to do it!) would be printing a summary of the entity i found. For example above, I would not have ALL the information about "The Police", but only basic general fields, that could be albums, the year they were born.
Is this possible to do something like that without the knowledge of what I found?

Related

Identifying personnal information from column description

I have a question about the identification of GDPR (General Data Protection Regulation) related sentences.
Is there a tool / method in Python, Java, ... that identifies whether a database column contains personnally identifiable information from its description only ?
We may think about using word embedding to get the "most_similar" or "most_similar_cosmul" words given a sentence and afterwards identifying keywords related to GDPR (biometric, personnal, id, photo...) but the results depend on the robustness of the word embedding model.
Thank you in advance,
There is no such thing as "personally identifiable information" in GDPR. The term (from GDPR article 4(1)) is "personal data", defined as:
any information relating to an identified or identifiable natural person
and it doesn't itself have to be identifying to qualify. What's an "identifiable natural person"? GDPR says:
an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person
The key thing that turns regular "data" into "personal data" here is that "one or more factors" phrase. A single field, such as a phone number, could reasonably be considered as uniquely identifying a person. By itself a postal code probably doesn't, but when combined with a street address and a first name, we'd be very close to being able to identify someone, and hence all other data would become "personal". It's hard to evaluate whether a collection of fields is enough to uniquely identify someone or not – you might think that first name and city might not identify an individual, given "John" and "London", but "Esmerelda" and "Ulaanbaatar" might be pretty easy to track down, and it's the "worst case" that counts.
For a simpler example: A colour value such as #663399 by itself is just plain "data", is not "personal data", and is not subject to GDPR. That exact same value stored as "favourite colour" in a field in a table linking that data to a person is personal data. "City" in a table of cities is not personal data, but a "city" field in a user table is.
In short, you're not going to be able to do what you want. You can't tell whether a field is personal data or not from its name because you have insufficient context.

Interpret Queries of Lucene

I was wondering if there is any way to interpret Queries of Lucene in simple terms?
For example :
Example # 1:
Input Query - name:John
Output - Interpreted as : Find all entries where attribute "name" is equal "John".
Example # 2:
Input Query - name:John AND phoneNumber:1234
Output - Interpreted as : Find all entries where attribute "name" is equal to "John" and attribute "phoneNumber" is equal to "1234".
Any tutorials in this regard will be helpful,
Thanks
The Lucene documentation does a pretty decent job in explaining basic queries and their interpretation. It seems as though that's all you're looking for; once you get into some of the more advanced query types, it gets hairy, but the documentation should always be your first stop; it's fairly comprehensive.
Edit: Ah, you want automated query explanation. I don't know of any that currently exist; I think you'll have to write your own, but if you're starting with standard QueryParser Syntax, I think the best input for your interpreter would be the output of QueryParser.parse(). That breaks down the free text into Lucene query objects that shouldn't be too difficult to wrap in a utility function that outputs a plain-English string for each one.

Solr: The default OR operator returns irrelevant results, when the fields are queried with multiple words

I need to make my Solr-based search return results if all of the search keywords appear anywhere in any of the search fields.
The current situation:
an example search query: keywords: "berlin house john" name: "berlin house john" name" author: "berlin house john" name"
Let's suppose that there is only one result, where keywords="house", name="berlin", and author="john" and there is no other possible permutation of these three words.
if the defaultOperator is OR, Solr returns a simple OR-ing of every keyword in every field, which is an enormous list, where of course, the best matching result is at the first position, but the next results have very little relevance (perhaps only one field matching), and they simply confuse the user.
On another hand, if i switch the default operator to AND, I get absolutely no results. I guess it is trying to find a perfect match for all three words, in all three fields, which of course, does not exist.
The search terms come to the application from a search input, in which, the user writes free text - there are no specific language conventions (hashtags or something).
I know that what I am asking about is possible because I have done it before with pure Lucene, and it worked. What am I doing wrong?
If you just need to make sure, all words appear in all fields I would suggest copying all relevant fields into one field at index time and query this one instead. To do so, you need to introduce a new field and then use copyField for all sourcefields you want to copy over. To copy all fields, use:
<copyField source="*" dest="text"/>
See http://wiki.apache.org/solr/SchemaXml#Copy_Fields for details.
An similar approach would be to use boolean algebra at query time. This is a bit different from the above solution.
Your query should look like
(keywords:"berlin" OR keywords:"house" OR keywords:"john") AND
(name:"berlin" OR name:"house" OR name:"john") AND
(author:"berlin" OR author:"house" OR author:"john")
which basically states: one or more terms must match in keyword and one or more terms must match in name and one or more terms must match in author.
From Solr 4, defaultOperator is deprecated. Please don't use it.
Also as for me defaultOperator works same as specified operator in query. I can't said why it is, its just my experience.
Please try query with param {!q.op=AND}
I guess you use default query parser, fix me if I am wrong

Lucene search on a Hibernate List field

I have a Hibernate annotated class TestClass that contains a List<String> field that I am indexing with Lucene. Consider the following example:
"Foo Bar" and "Bar Snafu" are two entries in the List for a particular record. Now, If a user searches on TestClass for "Foo Snafu" then the record will be found, I am guessing because the token Foo and the token Snafu are both tokens in the List<String> for this record.
Is there a way I can prevent this from happening?
The real world example is a Court case that has a List of Plaintiffs and Defendants. Say there are two people being prosecuted on the case, Joe Lewis Bob and Robert Clay Smith. These users are stored in the Court case record in a List of Defendants. This List of defendants is indexed with Lucene. Now if a user searches for either of the two defendants mentioned earlier, the case will be found. But the case will also be found if a user searches for Lewis Smith, or Joe Clay.
Update: It was mentioned in the Lucene IRC channel that I could possibly use a multi-valued field.
Update 2: It was mentioned in the Solr IRC channel that I could use the positionIncrementGap setting in schema.xml to accomplish this with Solr. Apparently if I use a phrase query (with or without slop) then "the increment gap ensures that different values in the same field won't cause an unintended match".
Lucene appends successive additions to the same field in the same document to the end of what it already has in the field.
If you want to treat each member of the List as an entirely separate entity, you should index them in different fields. you could just append the index to the field name you are already using. While I don't have complete information on your needs, of course, doing something like this is probably the better solution.
If you just want to search for the precise text "Foo Snafu", you can use a PhraseQuery. If you want to be sure your phrasequery doesn't cross from one list item to the next (ie, if you had "Bar Foo" and "Snafu Bar" in the index), you could insert some form of delimiting term between each member when writing to the index.

Searching multiple fields with Lucene

I'm having some trouble with a search I'm trying to implement. I need for a user to be able to enter a search query into a web interface and for the back-end Java to search for the query in a number of fields. An example of this might be best:
Say I have a List containing "Person" objects. Say each object holds two String fields about the person:
FirstName: Jack
Surname: Smith
FirstName Mary
Surname: Jackson
If a user enters, "jack", I need the search to match both objects, the first on Surname, and the second on FirstName.
I've been looking at using a MultiFieldQueryParser but can't get the fields set up right. Any help on this or pointing to a good tutorial would be greatly appreciated.
MultiFieldQueryParser is what you want, as you say.
Make sure:
The field names are always used consistently
The same Analyzer is used on both fields, and also on the query parser
You won't find partial words by default, so if you search for jack you won't find jackson. (You can search for jack* in that case.)
Regarding field name, I always set up an enum for my field names, then use e.g. MyFieldEnum.firstname.name() when passing field names to Lucene, so that if I make a spelling mistake the compiler can catch it, and it's also a good place to put Javadoc so you can see what the fields are for, and also a place where you can see the complete list of fields you wish to support in your Lucene documents.

Categories