Identifying personnal information from column description - java

I have a question about the identification of GDPR (General Data Protection Regulation) related sentences.
Is there a tool / method in Python, Java, ... that identifies whether a database column contains personnally identifiable information from its description only ?
We may think about using word embedding to get the "most_similar" or "most_similar_cosmul" words given a sentence and afterwards identifying keywords related to GDPR (biometric, personnal, id, photo...) but the results depend on the robustness of the word embedding model.
Thank you in advance,

There is no such thing as "personally identifiable information" in GDPR. The term (from GDPR article 4(1)) is "personal data", defined as:
any information relating to an identified or identifiable natural person
and it doesn't itself have to be identifying to qualify. What's an "identifiable natural person"? GDPR says:
an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person
The key thing that turns regular "data" into "personal data" here is that "one or more factors" phrase. A single field, such as a phone number, could reasonably be considered as uniquely identifying a person. By itself a postal code probably doesn't, but when combined with a street address and a first name, we'd be very close to being able to identify someone, and hence all other data would become "personal". It's hard to evaluate whether a collection of fields is enough to uniquely identify someone or not – you might think that first name and city might not identify an individual, given "John" and "London", but "Esmerelda" and "Ulaanbaatar" might be pretty easy to track down, and it's the "worst case" that counts.
For a simpler example: A colour value such as #663399 by itself is just plain "data", is not "personal data", and is not subject to GDPR. That exact same value stored as "favourite colour" in a field in a table linking that data to a person is personal data. "City" in a table of cities is not personal data, but a "city" field in a user table is.
In short, you're not going to be able to do what you want. You can't tell whether a field is personal data or not from its name because you have insufficient context.

Related

How to predict correct country name for user provided country name?

I am planning to do some data tuning on my data.
Situation-I have a data which has a field country. It contains user input country names( It might contain spelling mistakes or different country names for same country like US/U.S.A/United States for USA). I have a list of correct country names.
What I want- To predict which closest country it is referring to. For example- If U.S. is given then it will change to USA(correct country name in our list).
Is there any way I can do it using Java or opennlp or any other method?
You can use Getty API . It will give you abbreviations of country name. Just play on this API.
OR
You can also use Levenshtein Distance to get most closest country name.
Try this out. Will help you.
You can try Google's auto complete location api to your text box or select.
if you will use this api then you will get google like auto complete intellisence while typing.
visit link
If you have the city or state information that is sanitized then you could do a look up of the country.
You could also define aliases in your list of country names and point the aliases to the preferred notation. For example, US, United States, USA all are aliases of U.S.A. You could make the program to append to alias database so that it improves as it is being used. You might have do multiple passes over the data and also certain amount of manual work is involved.

Using Stanford NER for extracting Address from a text document?

I was looking Stanford NER and thinking of using JAVA Apis it to extract postal address from a text document. The document may be any document where there is an postal address section e.g. Utility Bills, electricity bills.
So what I am thinking as the approach is,
Define postal address as a named entity using LOCATION and other primitive named entities.
Define segmentation and other sub process.
I am trying to find a example pipeline for the same (what are the steps in details required), anyone has done this before? Suggestions welcome.
To be clear: all credit goes to Raj Vardhan (and John Bauer) who had an interaction on the [java-nlp-user] mailing list.
Raj Vardhan wrote about the plan to work on "finding street address in a sentence":
Here is an approach I have thought of:
Find the event-anchor in a sentence
Select outgoing-edges in the SemanticGraph from that event-node
with relations such as *"prep-in" *or "prep-at".
IF the dependent value in the relation has POS tag as NNP
a) Find outgoing-edges from dependent value's node with relations such
as "nn"
b) Connect all such nodes in increasing order of occurrence in the
sentence.
c) PRINT resulting value as Location where the event occurred
This is obviously with certain assumptions such as direct dependency
between the event-anchor and location in a sentence.
Not sure whether this could help you, but I wanted to mention it just in case. Again, any credit should go to Raj Vardhan (and John Bauer).

freebase MQL query to match any field

My application gets in input a certain amount of String, suppose the name of the "object" I'm looking for, and other fields like the year when an artist was born or the last album he made.
By the way the application has no knowledge on the type of the object in input, so what I'm trying to do is making an MQL query that, given the name of the object and other values (in any field, as I don't know the type of what I'm querying), returns me the type of what I searched. Once I get its type, I could for example make a better query asking for specifical fields.
As we all know, an example is worth thousand words, so let's assume my input is "The Police" and "So Lonely", one of their songs. I just know "The Police" is the name of what I'm looking for, but I don't know nothing about "So Lonely", so I should insert it someway in the query to get better results, without the knowledge of its type.
My first basic query is:
[{
"name": "The Police",
"type": []
}]
and it works, but I can't refine my search including so lonely, that could narrow the search output.
Any hint?
Another thing I need to accomplish, once I do the above (and still no idea on how to do it!) would be printing a summary of the entity i found. For example above, I would not have ALL the information about "The Police", but only basic general fields, that could be albums, the year they were born.
Is this possible to do something like that without the knowledge of what I found?

Lucene search on a Hibernate List field

I have a Hibernate annotated class TestClass that contains a List<String> field that I am indexing with Lucene. Consider the following example:
"Foo Bar" and "Bar Snafu" are two entries in the List for a particular record. Now, If a user searches on TestClass for "Foo Snafu" then the record will be found, I am guessing because the token Foo and the token Snafu are both tokens in the List<String> for this record.
Is there a way I can prevent this from happening?
The real world example is a Court case that has a List of Plaintiffs and Defendants. Say there are two people being prosecuted on the case, Joe Lewis Bob and Robert Clay Smith. These users are stored in the Court case record in a List of Defendants. This List of defendants is indexed with Lucene. Now if a user searches for either of the two defendants mentioned earlier, the case will be found. But the case will also be found if a user searches for Lewis Smith, or Joe Clay.
Update: It was mentioned in the Lucene IRC channel that I could possibly use a multi-valued field.
Update 2: It was mentioned in the Solr IRC channel that I could use the positionIncrementGap setting in schema.xml to accomplish this with Solr. Apparently if I use a phrase query (with or without slop) then "the increment gap ensures that different values in the same field won't cause an unintended match".
Lucene appends successive additions to the same field in the same document to the end of what it already has in the field.
If you want to treat each member of the List as an entirely separate entity, you should index them in different fields. you could just append the index to the field name you are already using. While I don't have complete information on your needs, of course, doing something like this is probably the better solution.
If you just want to search for the precise text "Foo Snafu", you can use a PhraseQuery. If you want to be sure your phrasequery doesn't cross from one list item to the next (ie, if you had "Bar Foo" and "Snafu Bar" in the index), you could insert some form of delimiting term between each member when writing to the index.

Searching multiple fields with Lucene

I'm having some trouble with a search I'm trying to implement. I need for a user to be able to enter a search query into a web interface and for the back-end Java to search for the query in a number of fields. An example of this might be best:
Say I have a List containing "Person" objects. Say each object holds two String fields about the person:
FirstName: Jack
Surname: Smith
FirstName Mary
Surname: Jackson
If a user enters, "jack", I need the search to match both objects, the first on Surname, and the second on FirstName.
I've been looking at using a MultiFieldQueryParser but can't get the fields set up right. Any help on this or pointing to a good tutorial would be greatly appreciated.
MultiFieldQueryParser is what you want, as you say.
Make sure:
The field names are always used consistently
The same Analyzer is used on both fields, and also on the query parser
You won't find partial words by default, so if you search for jack you won't find jackson. (You can search for jack* in that case.)
Regarding field name, I always set up an enum for my field names, then use e.g. MyFieldEnum.firstname.name() when passing field names to Lucene, so that if I make a spelling mistake the compiler can catch it, and it's also a good place to put Javadoc so you can see what the fields are for, and also a place where you can see the complete list of fields you wish to support in your Lucene documents.

Categories