NLP discover city, state and Name from the given text [closed] - java

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I have a text file which was generated from an image using OCR (optical character recognition). The file contains records of information where part of each record contains a text of the format Customer name city and state. A sample of text is below
Benjamin Meeks Decatur , GA
Some times the text may be split across multiple lines. The text will always be in the given order. I have a static list of cities and states but still some records and states can come out of the list. The comma between the state and city may or may not present. The city and state text mostly would contain USA, UK, Canada, Australia etc.
From one my friend i came to know natural language processing can solve mining the categories of text from the given input. I am noob to NLP so i am here for suggestions what are the techniques of NLP i can apply to extract the city, state and name.
I have googled for an openNLP library seems like apache openNLP seems to be the good library.
Thanks.

If you want to start with NLP I think OpenNLP is a good choice, another Java option could be StandfordNLP. If you are familiar with Python then go with NLTK.
About your problem I think that Named Entity Recognition is what you should look for. Is better if first you learn the basic of NLP and then use this specific "tecnique".
However here you can already find the OpenNLP chapter about this; as you can see you could also train your "code" in order to recognize exactly what you want, using machine learning techniques.
For OpenNLP there already exist some trained model for Location, Organization, Person ect. (here)

Related

Algorithm of crawling Top10 PR/Alexa sites [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I'm trying to write a script which will crawl current top 10 PR/Alexa sites. since PR/Alexa frequently changes. so my script should take care of this I mean if today there is not a site in top 10 but could be tomorrow.
I dont know how to start with. I know crawling concepts but here I'm stuck. there could be top50 sites or even top500 sites. which I can configure of course.
I read about Google spider but its very complicated for this simple task. How do Google,Yahoo,Bing crawl billions of sites around the web. I'm just curious. what is the cursor point, I mean how google can Identify newly launch site.
Ok these are very deep details, I would read about these later. right now I'm more concern about my problem. how could I crawl top10 PR sites.
Can you provide a sample program so that I can understand better?
It's rather simple to fetch top25sites (if I understood correctly what you wanted to do)
Code:
from bs4 import BeautifulSoup
from urllib.request import urlopen
b = BeautifulSoup(urlopen("http://www.alexa.com/topsites").read())
paragraphs = b.find_all('p', {'class':'desc-paragraph'})
for p in paragraphs:
print(p.a.text)
Output:
Google.com
Facebook.com
Youtube.com
Yahoo.com
Baidu.com
Wikipedia.org
(...)
But have in mind that law in some countries could be more strict. Do it on own risk.
Alexa has a paid API you can use
**There is also a free API**
There is a free API (though I haven't been able to find any documentation for it anywhere).
http://data.alexa.com/data?cli=10&url=%YOUR_URL%
You can also query for more data the following way:
http://data.alexa.com/data?cli=10&dat=snbamz&url=%YOUR_URL%
All the letters in dat are the ones that determine wich info you get. This dat string is the one I've been able to find wich seems to have more options. Also, cli changes the output completly, this option makes it return an XML with quite a lot of information.
EDIT: This API is the one used by the Alexa toolbar.
Fetching Alexa data

Best way to output Stanford NLP results [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
Hi folks: I'm using the Stanford CoreNLP software to process hundreds of letters by different people (each about 10KB). After I get the output, I need to further process it and add information at the level of tokens, sentences, and letters. I'm quite new to NLP and was wondering what the typical or best way would be to output the pipeline results from Stanford CoreNLP to permit further processing?
I'm guessing the typical approach would be to output to XML. If I do, I estimate that will take about a GB of disk space, and I wonder, then, how quick and easy it would be to load that much XML back into Java for further processing and adding of information?
An alternative might be to have CoreNLP serialize the annotation objects it produces and load those back for processing. An advantage: not having to figure out how to convert a sentence parse string back into a tree for further processing. A disadvantage: annotation objects contain a lot of different types of objects I'm still quite rough on manipulating and the documentation on these in Stanford CoreNLP seems slim to me.
This is really matter of what you want to do afterwards. Doing serialization is probably the most straightforward and fast approach, the con is that you need to understand the CoreNLP data structure.
What if you want to read it in another language or read into your own data structure, save as XML.
I would go the first way.

social media slang identifier [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I am doing a project on social media slang identifier.I have to identify abbreviations from different comments.But the problem is that, suppose in a particular comment it's written GM (means Good Morning) and at the same time in some other comment it's written again GM (means General Manager).
So I need to differentiate between these two, although it shows same in both case (i.e. GM).
I am really confused in this problem.I am not getting any idea for this.
Can any one help me to overcome from this?
This is a hard problem. You need some semantic algorithm to make this distinction.
You cannot infer the meaning just from the syntax or just from the textual representation.
Google "disambiguation natural language processing". You will see lots of resources.
This is just to give you a hint. As said the problem is broad and complex.
This sounds like a very complex issue.
From my understanding of it you would need a quite large dictionary of these abbreviations and also, the lexical field (a.k.a. semantic field) in which they are used.
In order to detect the lexical field you could also group the speakers into "work related" or "colleagues from university" or "drinking buddies", and maybe have a standard for these groups, so that data from other users is also used. In order to understand this, maybe you can understand a sort of synonym of slang, which is argot.
So for instance, if someone says "the GM's feedback was actually pretty good" not only do you understand that it is a usual noun but feedback is also from the "business" lexical field.
An actual time frame, and data you'd work with would be useful, and I will edit this answer accordingly.

Turn HTML into XML and parse it -- Android Apps [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I have been learning how to build android apps this summer. I am currently trying to work on xml parsing which falls under java in this case. I have a few questions that are mostly conceptual and one specific one.
First, in most of the examples I have seen pages already in xml are used. Can I use a page in regular html format and with whatever the program does turn it to xml and then parse it? Or is that what is normally done anyway?
Secondly, I could use a little explanation on how the parser actually works and saves the data so I will better know how to use it (extract it from whatever it is saved in), when the parsing is done.
So for my specific example I am trying to work with some weather data from the NWS. My program will take the data from this page, and after some user input take you to a page like this, which sometimes will have various alerts. I want to select certain ones. This is what I could use help with. I haven't really coded anything on that yet because I don't know what I am doing.
If I need to clarify or rephrase anything in here I am happy too and let me know. I am trying to be a good contributor on here!
Yes you can parse HTML and there are many parsers available too, there is a question about it here Parse HTML in Android, then we have an answer here about parsing html https://stackoverflow.com/a/7114346/826657
Although its a bad idea, as the tag names aren't well named, so you will have to write lots of code searching attributes for a specific data tag, so you always have to prefer XML,for saving lots of code space and also time.
Here is a text from CodingHorror which says at general parsing html is a bad idea.
http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html
Here is something which explains parsing an XML document using XML PullParser http://www.ibm.com/developerworks/library/x-android/

Wikipedia articles' first sentence and Java [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I have to find the definitions of >200 words. I would like to use Wikipedia in order to search the article intitled with a given word from the list and then extract the raw text of its definition (the first sentence of the article).
In fact, in my project, I have a Jlist which contains words (simple and composed). I would like to find for each word a definition using Wikipedia (I chose this encyclopedia because the words are extracted from a specialized corpus).
My question is: how can I retrieve definitions from a Wikipedia dump? I found JWPL but I did not find an example which helps me to use it.
Another question is: if I have Wikipedia offline (using WikiTaxi), how can I extract definitions from it using Java?
Wikipedia is creative-common licensed (see their terms of use for what is permissible)
Wikipedia does already have an API, which would probably be better for your purposes than developing your own. More info on the API here.
The other thing worth considering is, if you want definitions, perhaps you would be better off using wiktionary? Wiktionary also has their own API
Here is an example API Call to get the wiki text on "stack overflow"
http://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=stack_overflow&rvprop=content
Here is an example query to return the word "stack" from wiktionary:
http://en.wiktionary.org/w/api.php?action=query&prop=revisions&titles=stack&rvprop=content
You may still need to parse the output, but it gets you what you want...
If you wanted to do a quick and dirty screen scrape, their URLs are fairly easy to construct. The url would basically be http://en.wikipedia.org/wiki/ + a sanitized word (e.g. spaces replaced with _ etc)
An example url made up on the spot would be http://en.wikipedia.org/wiki/Stack_overflow which will take you directly to the Stack Overflow entry on wikipedia.
The body content in wikipedia begins at this comment <!-- bodycontent --> and is contained within a div with this id: mw-content-ltr You would likely be looking for the first <p> tag.

Categories