Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I have to find the definitions of >200 words. I would like to use Wikipedia in order to search the article intitled with a given word from the list and then extract the raw text of its definition (the first sentence of the article).
In fact, in my project, I have a Jlist which contains words (simple and composed). I would like to find for each word a definition using Wikipedia (I chose this encyclopedia because the words are extracted from a specialized corpus).
My question is: how can I retrieve definitions from a Wikipedia dump? I found JWPL but I did not find an example which helps me to use it.
Another question is: if I have Wikipedia offline (using WikiTaxi), how can I extract definitions from it using Java?
Wikipedia is creative-common licensed (see their terms of use for what is permissible)
Wikipedia does already have an API, which would probably be better for your purposes than developing your own. More info on the API here.
The other thing worth considering is, if you want definitions, perhaps you would be better off using wiktionary? Wiktionary also has their own API
Here is an example API Call to get the wiki text on "stack overflow"
Here is an example query to return the word "stack" from wiktionary:
You may still need to parse the output, but it gets you what you want...
If you wanted to do a quick and dirty screen scrape, their URLs are fairly easy to construct. The url would basically be + a sanitized word (e.g. spaces replaced with _ etc)
An example url made up on the spot would be which will take you directly to the Stack Overflow entry on wikipedia.
The body content in wikipedia begins at this comment <!-- bodycontent --> and is contained within a div with this id: mw-content-ltr You would likely be looking for the first <p> tag.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I've an xml like which looks like this one:
<title> randmonString </title>
<someInfo> info </someInfo>
<subElemntTrash> trash </subElementTrash>
<someInfo1> info1 </someInfo1>
<subtrash> trash </subtrash>
<date> 19.03.15 </date>
I need to extract only title, some /subElement1/subInfo, /subElement1/subInfo1 and date, rest should be automatically stored somewhere but without those elements, that were already extracted. I also should have possibility to marshal it back to the original xml.
It would be great if it can be done using annotation mapping.
Can someone give me the right direction to search?
You are asking about parsing, but then you want data extraction, data transformation and finally storing in some undefined form. Very broad question with many possible aporaches.
You can parse XML in java using DOM, SAX, StAX.
You can use XPath to extract interesting information, but it will not divide your document into the interesting bit and the 'rest'.
You can define XSLT templates, to initiate java Transformer, in order to split your input document into the interesting and 'the rest' parts.
You can use JAXB to map the xml into an java model (using your favourite the annotation mapping), and then you can build another representations containing your interesting and 'the rest' part. Then you can save both representation to different xml.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I am looking for a standard technology to drive the generation of an XML document based on an XSD and a set of rules. Basically I have XSDs that tell me what the XML should look like and what elements are mandatory or optional. What is not in the XSDs is a set of business rules that say things like "if such element's value is this, that other element is actually mandatory" or "if such element's value is that, that other element should be omitted".
What I have in mind is something that would process the XSDs along with the rules (maybe expressed in something like XPath) and call back my code to generate the mandatory values. The structure of the final document would change dynamically depending on the values of the elements driving the conditions.
I guess I could do something close to what I want with XSLTs. I'd generate all the values with and then use an XSLT to enforce the conditions. But in my case some values maybe take long to produce so I want to avoid computing unnecessary values, meaning values that will be later discarded by the business rules.
Does such a technology exist? FYI I am coding in Java but I am hoping to find a generic technology if possible.
The problem you described can probably be handled by Schematron. It can be used with XML Schema, and if you already know XPath and XSLT you won't find it difficult to understand. If can specify complex relationships between unrelated nodes based on values and context beyond the abilities of XML Schema.
The specification and many tutorials you can find in the Schematron website.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I have a text file which was generated from an image using OCR (optical character recognition). The file contains records of information where part of each record contains a text of the format Customer name city and state. A sample of text is below
Benjamin Meeks Decatur , GA
Some times the text may be split across multiple lines. The text will always be in the given order. I have a static list of cities and states but still some records and states can come out of the list. The comma between the state and city may or may not present. The city and state text mostly would contain USA, UK, Canada, Australia etc.
From one my friend i came to know natural language processing can solve mining the categories of text from the given input. I am noob to NLP so i am here for suggestions what are the techniques of NLP i can apply to extract the city, state and name.
I have googled for an openNLP library seems like apache openNLP seems to be the good library.
If you want to start with NLP I think OpenNLP is a good choice, another Java option could be StandfordNLP. If you are familiar with Python then go with NLTK.
About your problem I think that Named Entity Recognition is what you should look for. Is better if first you learn the basic of NLP and then use this specific "tecnique".
However here you can already find the OpenNLP chapter about this; as you can see you could also train your "code" in order to recognize exactly what you want, using machine learning techniques.
For OpenNLP there already exist some trained model for Location, Organization, Person ect. (here)
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions concerning problems with code you've written must describe the specific problem — and include valid code to reproduce it — in the question itself. See for guidance.
Closed 9 years ago.
Improve this question
I have a huge problem with parsing an XML file to a different format.
I'm trying to get all the related data like stated in this link:
(I searched stackoverflow before and found this link)
I use the interface XMLReader to parse and the XML Serializer for the output.
I just need to convert my XML with a DTD to another XML with a different DTD. The difference is that, instead of elements from my source XML, most of the children are now attributes in the target XML. There are no new elements, only a different arrangement.
Has anyone an idea how to deal with the problem with a SAX parser?
You can use XMLFilters for that. See Elliotte Rusty Harold's book for explanation and examples:
The basic idea of filters is that an XMLReader, instead of receiving
XML text directly from a file, socket, or other source, receives
already parsed events from another XMLReader. It can change these
events before passing them along to the client application through the
usual methods of ContentHandler and the other callback interfaces. For
example, it can add a unique ID attribute to every element or delete
all elements in the SVG namespace from the input stream.
BTW the mkyong tutorial glosses over how the characters method works, that tends to bite a lot of people when they find their element data getting truncated. There's a better tutorial on Oracle's site.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I have been learning how to build android apps this summer. I am currently trying to work on xml parsing which falls under java in this case. I have a few questions that are mostly conceptual and one specific one.
First, in most of the examples I have seen pages already in xml are used. Can I use a page in regular html format and with whatever the program does turn it to xml and then parse it? Or is that what is normally done anyway?
Secondly, I could use a little explanation on how the parser actually works and saves the data so I will better know how to use it (extract it from whatever it is saved in), when the parsing is done.
So for my specific example I am trying to work with some weather data from the NWS. My program will take the data from this page, and after some user input take you to a page like this, which sometimes will have various alerts. I want to select certain ones. This is what I could use help with. I haven't really coded anything on that yet because I don't know what I am doing.
If I need to clarify or rephrase anything in here I am happy too and let me know. I am trying to be a good contributor on here!
Yes you can parse HTML and there are many parsers available too, there is a question about it here Parse HTML in Android, then we have an answer here about parsing html
Although its a bad idea, as the tag names aren't well named, so you will have to write lots of code searching attributes for a specific data tag, so you always have to prefer XML,for saving lots of code space and also time.
Here is a text from CodingHorror which says at general parsing html is a bad idea.
Here is something which explains parsing an XML document using XML PullParser