Extracting data from sites [closed]

Extracting data from sites [closed] - java

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I want to extract data from sites. I already got information from sites using the article extractor but now I want to get the information about events of a particular place. I want to get the events in that place when I give location as input.For example I want to extract information from this site "http://www.indianevents.org/events-Rajasthan-14.htm" I could be able to extract all events,festivals etc.
URL url;
url = new URL(str);
InputSource is = HTMLFetcher.fetch(url).toInputSource();
BoilerpipeSAXInput in = new BoilerpipeSAXInput(is);
TextDocument doc = in.getTextDocument();
news=ArticleExtractor.INSTANCE.getText(doc);

consider Apache Tika to download the text content
you can use stanford pos tagger to parse the text into
meaningful sentences
and NLP can help identify event information.
although writing this might sound simple (trust me its difficult).
Good Luck. :)

Related

How can i import some information result in website to my program [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I'm working on a project (java) that search for whois information for websites, so I started searching for a good websites that provides information about whois lookup, it could be easy for me if I find xml service.
This is the
the site that provides the information I want to get all the result to my program .
Somone have an idea ?

first step: to load
InputStream in = new URL("http://toolbar.netcraft.com/site_report?url=http://louisville.edu").openStream();
String content=IOUtils.toString(in);
in.close();
Second step: parse XML: use DOM, read some tutorial
for example: DOM XML Parser Example

Extract data from microsoft word to a database table [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
We receive word documents that are practically a form. Users fill in the answers to the questions we have in the documents, so essentially just a key-value pair of question and answer.
Now, I would like to extract the answers and store it in the database table mapping to the appropriate column(question). What is the best way to do this? Is there a library that can help me achieve this.
Thanks

I would consider unzipping the .docx file and extracting the information from the embedded .xml file. You can find out more about the Word 2010 format here:
http://blogs.msdn.com/b/chrisrae/archive/2010/10/06/where-is-the-documentation-for-office-s-docx-xlsx-pptx-formats-part-2-office-2010.aspx

How to categorize a text paragraph into predefined categories? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I have a list of categories like Sports,Game,Religion,Finance,Market Rates,I.T,Health,Adult,Business,B2B, government, politics, education etc..
now I want to categorise a text paragraph into these categories, actually I extract whole text from a particular URL and want to categorize text into my categories, at this time I'm using dbpedia,also I have used many technologies, but unfortunately I'm still not reaching to my aim, can someone help me please...I shall be grateful.

There is an old but very good paper that covers the task of text categorization. It can be very useful for you as an introduction:
Machine Learning in automated Text Categorization, Fabrizio Sebastiani, 2002
http://orb.essex.ac.uk/CE/CE807/Readings/sebastiani02.pdf

PDF to lucene document coversion using pdfbox [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 9 years ago.
Improve this question
PDFbox provides classes to convert a pdf to lucene document. Does it preserve the formatting of the document.By formatting i mean does it store details about the location and font type/size and other options.

By default, it will remove all formatting and extract only textual content and make it searchable. This content can be searched, and the original PDF can be maintained external to the index and returned with search results when a hit has been found. Rebuilding a PDF from the Lucene index may not be the best approach, if that is your intent.
PDFBox is quite capable of extracting metadata, though, and it can certainly be used to index formatting / font / etc data, if you wish to be able to search on that sort of thing.

How to parse this XML using Java [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
<tickettypes>
<eventid>8</eventid>
<eventname>air</eventname>
<tablename>tbl_tickets</tablename>
<ticketid>1</ticketid>
<name>Platinum</name>
<price>200.00</price>
<printable>Y</printable>
<ticketid>2</ticketid>
<name>Gold</name>
<price>150.00</price>
<printable>Y</printable>
<ticketid>3</ticketid>
<name>Silver</name>
<price>100.00</price>
<printable>Y</printable>
<ticketid>4</ticketid>
<name>Test</name>
<price>50.00</price>
<printable>Y</printable>
<surveys>
<surveyid>0</surveyid>
<surveyname>No Survey entered</surveyname>
<surveyid>1</surveyid>
<surveyname>Advertisement</surveyname>
<surveyid>2</surveyid>
<surveyname>Friends</surveyname>
<surveyid>3</surveyid>
<surveyname>Web Reference</surveyname>
<surveyid>4</surveyid>
<surveyname>News Paper</surveyname>
<surveyid>5</surveyid>
<surveyname>portals</surveyname>
</surveys>
</tickettypes>
can any one please help me how to parse the following xml file

The best way to read XML in Android is to use the Simple XML Framework. If you want a step by step guide then you should take a look at the blog post on the topic that I wrote. If you knew simple then this would take you all of ten minutes to write the code that can parse it.

There are quite a few examples of this online; the first page of a Google search "parse xml in java" was filled with examples.
http://www.java-samples.com/showtutorial.php?tutorialid=152
http://www.totheriver.com/learn/xml/xmltutorial.html
http://www.developerfusion.com/code/2064/a-simple-way-to-read-an-xml-file-in-java/

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Extracting data from sites [closed] - java

consider Apache Tika to download the text content you can use stanford pos tagger to parse the text into meaningful sentences and NLP can help identify event information. although writing this might sound simple (trust me its difficult). Good Luck. :)

Related

How can i import some information result in website to my program [closed]

Extract data from microsoft word to a database table [closed]

How to categorize a text paragraph into predefined categories? [closed]

PDF to lucene document coversion using pdfbox [closed]

How to parse this XML using Java [closed]

Categories

Resources