Extracting keywords from text in Android

Extracting keywords from text in Android - java

Let us say, a user is typing text in an EditText. Now, as the user is typing, I want to extract the keywords from those texts.
For example, if user types- "I am having headache". It should extract "headache" as a keyword.
Please let me know how I can do this efficiently in Android.
Update: I do not know what the keywords are. They have to be extracted from the text which user enters.

First of all you should define what you will consider as keywords.
a. A limited list of words which are the keywords.
Or b. A limited list of words which are not the keywords.
That list can be in an ArrayList<String> in your code.
When the user changes the text in the EditText (see EditText.addTextChangedListener(new TextWatcher(){...}), you get the text and split() it into a String [] using space character as a delimiter. Next search each word in the array in your list (a or b options on top) either to check if they are or not there. When you get a hit you have found a keyword entered by the user.
The resulting keywords can be kept temporarily in another ArrayList<String> for you to use them after finishing the scanning of the input.
Note: I have proposed an ArrayList, to keep the list, considering that it won't be a long list. Fo more complex scenarios the list can be kept in HashMap or a TreeMap in the lines of what #Deepakkaku commented for the search to be quicker.

There can be two approaches to this problem:
Hardcoding the keywords or non-keywords you are interested in. #Juan's answer is the way to go here.
Second option is using some machine learnt model, which is what you are looking at, I guess given your machine-learning tag.
Option 1 requires a set of keywords defined ahead of time, which you say you don't have in your question. So this won't work in such a case. So here's a solution for Option 2.
Create a model.
You have to create a dataset of labeled examples.
You have to define a vocabulary for your entire dataset.
You have to define and train a model. If you have enough data, you can start from scratch. Otherwise, it is recommended to use transfer learning. For example, you can look up NLP models such as word2vec or sentiment analysis online and look up transfer learning. TF Hub makes it easy to do transfer learning.
Once you have trained the model, you have to workout how to convert that model so as to run efficiently on Android for inference. You have choices in Tensorflow-lite, Caffe2, etc. If you use Tensorflow, it is recommended that you convert to Tensorflow Lite for inference for efficiency.
You have to build your Android app with the appropriate runtime (TFLite, Caffe2, etc.) and bundle the model in. You can use ML Kit to take care of the download for you if you don't want to bundle.
Add the hooks to model in your activity by listening to changes in your EditText and calling the model inference. You likely want the model interpreter to be loaded ahead of time before the first inference call for efficiency.

Related

How do I store objects if I want to search them by multiple attributes later?

I want to code a simple project in java in order to keep track of my watched/owned tv shows, movies, books, etc.
Searching and retrieving the metadata from an API (themovieDB, Google Books) is already working.
How would I store some of this metadata together with user-input (like progress or rating)?
I'm planning on displaying the data in a table like form (example). Users should also be able to search the local data with multiple attributes. Is there any easy way to do this? I already thought about a database since it seemed that was the easiest solution.
Any suggestions?
Thanks in advance!

You can use lightweight database as H2, HSQLDB or SqlLite. These databases can be embedded in the Java app itself and does not require extra server.
If your data is less, you can also save it in XML or Json by using any XMLParser or JsonParser (e.g. Gson()).
Your DB table will have various attributes which are fetched from API as well as user inputs. You can write query on the top of these DBs to fetch and show the various results.

Either write everything to files, or store everything on a database. It depends on what you want though.
If you choose to write everything to files, you'll have to implement both the writing and the reading to suit your needs. You'll also have to deal with read/write bugs and performance issues yourself.
If you choose a database, you'll just have to implement the high level read and write methods, i.e., the methods that format the data and store it on the appropriate tables. The actual reading and writing is already implemented and optimized for performance.
Overall, databases are usually the smart choice. Although, be careful of which one you choose. Some types might be better for reading, while others are better for writting. You should carefully evaluate what's best, given your problem's domain.

There are many ways to accomplish this but as another user posted, a database is the clear choice.
However, if you're looking to make a program to learn with or something simple for personal use, you could also use a multi dimensional array of strings to hold the name of the program, as well as any other metadata fields and treat the array like a table in excel. This is not the best way to do it, but you can get away with it with very simple code. To search you would only need to loop through the array elements and check that the name of the program (i.e. movieArray[x][0] matches the search string. Once located you can perform actions or edit the other array indexes pertaining to that movie.
For a little more versatility, you would create a class to hold the movie information with fields to hold any metadata. The advantage here is that the metadata fields can be different types rather than having to conform to the array type, and their packaged together in the instance of the class. If you're getting the info from an API then you can update or create the classes from the API response. These objects can be stored in an ArrayList and searched with a loop that checks for a certain value i.e.
for (Movie M : movieArrayList){
if(m.getTitle().equals("Arrival")){
return m;
}
}
Alternatively of course for large scale, a database would be the best answer but it all depends what this is really for and what it's needs will be in the real world.

Methods of Data Storage - Opinion

I have been working on an app and have encountered some limitations relating to my lack of experience in Java IO and data persistence. Basically I need to store information on a few Spinner objects. So far I have saved information on each Spinner into a text file using the format of:
//Blank Line
Name //the first drop-down entry of the spinner
Type //an enum value
Entries //a semicolon-separated list of the drop-down entry String values
//Blank line
And then, assuming this rigid syntax is followed always, I've extracted this information from the saved .txt whenever the app is started. But things such as editing these entries and working with certain aspects of the Scanner have been an absolute nightmare. If anything is off by even one line or space of blankness BAM! everything is ruined. There must be a better way to store information for easy access, something with some search-eability, something that won't be erased the moment the app closes and that isn't completely laxed in its layout to the extent that the most minor of changes destroys everything.
Any recommendations for how to save a simple String, a simple int, and an array of String outside the app? I am looking for a recommendation from an experienced developer here. I have seen the storage options, but am unsure which would be best for just a few simple things. Everything I need could be represented in a 3 X n table wherein n is the number of spinners.

Since your requirements are so minimal, I think the shared preferences approach is probably the best option. If your requirements were more complicated, then a using a database would start to make more sense.
Using shared preferences for simple data like yours really is as simple as the example shown on the storage options page.

Assign a paper to a reviewer based on keywords

I was wondering if you know any algorithm that can do an automatic assignment for the following situation: I have some papers with a some keywords defined, and some reviewers that have some specific keywords defined. How could I do an automatic mapping, so that the reviewer could review the papers from his/her area of interest?

If you are open to using external tools Lucene is a library that will allow you to search text based on (from their website)
phrase queries, wildcard queries, proximity queries, range queries and more
fielded searching (e.g., title, author, contents)
date-range searching
sorting by any field
multiple-index searching with merged results
allows simultaneous update and searching

You will basically need to design your own parser, or specialize an existing parser according to your needs. You need to scan the papers, and according to your keywords,search and match your tokens accordingly. Then the sentences with these keywords are to be separated and displayed to the reviewer.
I would suggest the Stanford NLP POS tagger. Every keyword that you would need, will fall under some part-of-speech. You can then just tag your complete document, and search for those tags and accordingly sort out the sentences.

Apache Lucene could be one solution.
It allows you to index documents either in a RAM directory, or within a real directory of your file system, and then to perform full-text searches.
Its proposes a lot of very interesting features like filters or analyzers. You can for example:
remove the stop words depending on the language of the documents (e.g. for english: a, the, of, etc.);
stem the tokens (e.g. function, functional, functionality, etc., are considered as a single instance);
perform complex queries (e.g. review*, keyw?rds, "to be or not to be", etc.);
and so on and so forth...
You should have a look! Don't hesitate to ask me some code samples if Lucene is the way you chose :)

Find arbitary patterns common to a group of strings

Background:
I am developing a program in that iterates over all the movies & tv series episodes stored on my computer, rates them (using rotten tomatoes) and sorts them in order of rating.
I extract the movie name by removing all the unneccessary text such as '.avi', '720p' etc. from the file name.
I am using Java.
Problem:
Some folders contain movie files such as:
Episode 301 Rainforest Schmainforest.avi
Episode 302 Spontaneous Combustion.avi
The word 'Episode' and numbers are valid and are common words in movies, so I can't simply remove them. However, It is clear from the repetitive nature of the names that 'Episode' and '3XX' should be removed.
Aother folder might be:
720p.S5.E1.cripple fight.avi
720p.S5.E2.towelie.avi
Many arbitary patterns like these exist in different groups of files, and I need something to recongise these arbitary patterns so I can extract the keywords. It would be unfeasible to write regex for each case.
Summary:
Is there a tool or API that I can use to find complex repetitive patterns (must be able to match sequences of numbers)? [something like a longest common sequence library]

Well, you could simply take all the filtered names in your dir, and do a simple word-count. You could give extra weight to words that occur in (roughly) the same spot every time.
In the end you'd end up with a count and a weight, and you need to decide what lines to draw. It's probably not every file in the dir (because of maybe images or samples), but if most have a certain word, it's not "the" or something like that, and mabye they all appear "at the start" or "on the second spot", you can filter them.
But this wouldn't work for, random example, Friends episodes. THey're all called "The one where.....". That would be filtered in every sane version of your sought-after algorithm
The bottom line is: I don't think you can because of the friends-episode-problem. There just not enough distinction between wanted repetition and unwanted repetition.
Only thing you can do is make a blacklist of stuff you want to filter, like you allready seem to do with the avi / 720 thing.

I believe that what you are asking for is not trivial. Pattern extraction, as opposed to mere recognition, is well within the fields of artificial intelligence and knowledge discovery. I have encountered several related libraries for Java, but most need a lot of additional code to define even the simplest task.
Since this is a rather hot research area, you might want to perform a cursory search in Google Scholar, using appropriate keywords.
Disclaimer: before you use any library or algorithm found via the Internet, you should investigate its legal status. Unfortunately quite a few of the algorithms that are developed in active research areas are often encumbered by patents and such...

I have a kind-of answer posted here
http://pastebin.com/Eb0cQyKd
I wanted to remove non-unique parts of file names such as'720dpi', 'Episode', 'xvid' 'ac3' without specifying in advance what they would be. But I wanted to keep information like S01E01. I had created a huge black list but it wasn't convenient because the list kept on changing.
The code linked above uses Python (not Java) to remove all non-unique words in a file name.
Basically it creates a list of all the words used in the file names, and any word which comes up for most of the files it puts into a dictionary. Then it iterates through the files and deletes all these dictionary words from them.
The script also does some cleaning: some movies use underscores ('_') or periods ('.') to separate words in the filenames. I convert all these to spaces.
I have used it a lot recently and it works well.

String analysis and classification

I am developing a financial manager in my freetime with Java and Swing GUI. When the user adds a new entry, he is prompted to fill in: Moneyamount, Date, Comment and Section (e.g. Car, Salary, Computer, Food,...)
The sections are created "on the fly". When the user enters a new section, it will be added to the section-jcombobox for further selection. The other point is, that the comments could be in different languages. So the list of hard coded words and synonyms would be enormous.
So, my question is, is it possible to analyse the comment (e.g. "Fuel", "Car service", "Lunch at **") and preselect a fitting Section.
My first thought was, do it with a neural network and learn from the input, if the user selects another section.
But my problem is, I don´t know how to start at all. I tried "encog" with Eclipse and did some tutorials (XOR,...). But all of them are only using doubles as in/output.
Anyone could give me a hint how to start or any other possible solution for this?
Here is a runable JAR (current development state, requires Java7) and the Sourceforge Page

Forget about neural networks. This is a highly technical and specialized field of artificial intelligence, which is probably not suitable for your problem, and requires a solid expertise. Besides, there is a lot of simpler and better solutions for your problem.
First obvious solution, build a list of words and synonyms for all your sections and parse for these synonyms. You can then collect comments online for synonyms analysis, or use parse comments/sections provided by your users to statistically detect relations between words, etc...
There is an infinite number of possible solutions, ranging from the simplest to the most overkill. Now you need to define if this feature of your system is critical (prefilling? probably not, then)... and what any development effort will bring you. One hour of work could bring you a 80% satisfying feature, while aiming for 90% would cost one week of work. Is it really worth it?
Go for the simplest solution and tackle the real challenge of any dev project: delivering. Once your app is delivered, then you can always go back and improve as needed.

String myString = new String(paramInput);
if(myString.contains("FUEL")){
//do the fuel functionality
}

In a simple app, if you will be having only some specific sections in your application then you can get string from comments and check it if it contains some keywords and then according to it change the value of Section.

If you have a lot of categories, I would use something like Apache Lucene where you could index all the categories with their name's and potential keywords/phrases that might appear in a users description. Then you could simply run the description through Lucene and use the top matched category as a "best guess".
P.S. Neural Network inputs and outputs will always be doubles or floats with a value between 0 and 1. As for how to implement String matching I wouldn't even know where to start.

It seems to me that following will do:
hard word statistics
maybe a stemming class (English/Spanish) which reduce a word like "lunches" to "lunch".
a list of most frequent non-words (the, at, a, for, ...)
The best fit is a linear problem, so theoretical fit for a neural net, but why not take immediately the numerical best fit.

A machine learning algorithm such as an Artificial Neural Network doesn't seem like the best solution here. ANNs can be used for multi-class classification (i.e. 'to which of the provided pre-trained classes does the input represent?' not just 'does the input represent an X?') which fits your use case. The problem is that they are supervised learning methods and as such you need to provide a list of pairs of keywords and classes (Sections) that spans every possible input that your users will provide. This is impossible and in practice ANNs are re-trained when more data is available to produce better results and create a more accurate decision boundary / representation of the function that maps the inputs to outputs. This also assumes that you know all possible classes before you start and each of those classes has training input values that you provide.
The issue is that the input to your ANN (a list of characters or a numerical hash of the string) provides no context by which to classify. There's no higher level information provided that describes the word's meaning. This means that a different word that hashes to a numerically close value can be misclassified if there was insufficient training data.
(As maclema said, the output from an ANN will always be floats with each value representing proximity to a class - or a class with a level of uncertainty.)
A better solution would be to employ some kind of word-relation or synonym graph. A Bag of words model might be useful here.
Edit: In light of your comment that you don't know the Sections before hand,
an easy solution to program would be to provide a list of keywords in a file that gets updated as people use the program. Simply storing a mapping of provided comments -> Sections, which you will already have in your database, would allow you to filter out non-keywords (and, or, the, ...). One option is to then find a list of each Section that the typed keywords belong to and suggest multiple Sections and let the user pick one. The feedback that you get from user selections would enable improvements of suggestions in the future. Another would be to calculate a Bayesian probability - the probability that this word belongs to Section X given the previous stored mappings - for all keywords and Sections and either take the modal Section or normalise over each unique keyword and take the mean. Calculations of probabilities will need to be updated as you gather more information ofcourse, perhaps this could be done with every new addition in a background thread.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.