Extracting word count From an XML File

Extracting word count From an XML File - java

(This question is related to the previous question I posted earlier on stackoverflow...here is the link
Extracting Values From an XML File Either using XPath, SAX or DOM for this Specific Scenario)
The question is that keeping the above case in mind, instead of getting sentences, if i would like to get the words written by each participant in all sentences. For Example. if the word 'Budget' is used ten times in total and seven times by participant 'Dolske' and three times by others. So I need the list of all words and how many times it is written by each participant? Also the list of words in each turn?
What is the best strategy to achieve this? Any sample codes?
The XML is attached here (you can also check it in the referred question)
"(495584) Firefox - search suggestions passes wrong previous result to form history"
<Turn>
<Date>'2009-06-14 18:55:25'</Date>
<From>'Justin Dolske'</From>
<Text>
<Sentence ID = "3.1"> Created an attachment (id=383211) [details] Patch v.2</Sentence>
<Sentence ID = "3.2"> Ah. So, there's a ._formHistoryResult in the....</Sentence>
<Sentence ID = "3.3"> The simple fix it to just discard the service's form history result.</Sentence>
<Sentence ID = "3.4"> Otherwise it's trying to use a old form history result that no longer applies for the search string.</Sentence>
</Text>
</Turn>
<Turn>
<Date>'2009-06-19 12:07:34'</Date>
<From>'Gavin Sharp'</From>
<Text>
<Sentence ID = "4.1"> (From update of attachment 383211 [details])</Sentence>
<Sentence ID = "4.2"> Perhaps we should rename one of them to _fhResult just to reduce confusion?</Sentence>
</Text>
</Turn>
<Turn>
<Date>'2009-06-19 13:17:56'</Date>
<From>'Justin Dolske'</From>
<Text>
<Sentence ID = "5.1"> (In reply to comment #3)</Sentence>
<Sentence ID = "5.2"> &gt; (From update of attachment 383211 [details] [details])</Sentence>
<Sentence ID = "5.3"> &gt; Perhaps we should rename one of them to _fhResult just to reduce confusion?</Sentence>
<Sentence ID = "5.4"> Good point.</Sentence>
<Sentence ID = "5.5"> I renamed the one in the wrapper to _formHistResult. </Sentence>
<Sentence ID = "5.6"> fhResult seemed maybe a bit too short.</Sentence>
</Text>
</Turn>
.....
and so on
Help will be highly appreciated...

Get all of the values, better use sTax parser, it is good for such kind of tasks. Then split all of the senteces in words and do whatever you want.
Like create a model with Class Turn, where you store the author and the sentences, write services for this class and go on. :)
To split sentence in words, use split() or StringTokenizer, but tokenizer is deprecated. To use split, create a temp array, like
stringArray = sentence.toString().split(" ");
or like "sentence.getValue()", whatever.
where in method parameter you put the regEx. In your case it is a simple space, cause it splits the sentence. Then you could just go over the words and count what you need.
In case of ArrayList, use List.toArray() to get your list in the array view.

Related

Java XML Parsing into object

We have a requirement of parsing xml data in java.
The xml data looks like this:
xml_data
1> The first question here is how to parse such data.I have tried the code suggested in this link :
xmlparse with Node
and have partial success. The only doubt here is how can i can to reach "" tag.
2>The second question is if the data is parsed how can we can store the values in single object,so that we can get length and values of "CRAWL", "Test", "Check"(Tables) separately(from single object)?
Please help me as how can we achieve this?
Abhi

Get historic prices by ISIN from yahoo finance

I have the following problem:
I have around 1000 unique ISIN numbers of stock exchange listed companies.
I need the historic prices of these companies starting with the earliest listing until today on a daily basis.
However, as far as my research goes, yahoo can only provide prices for stock ticker symbols, which I do not have.
Is there a way to get for example for ISIN: AT0000609664, which is the company Porr the historic prices from yahoo automatically via their api?
I appreciate your replies!

The Answer:
To get the Yahoo ticker symbol from an ISIN, take a look at the yahoo.finance.isin table, here is an example query:
http://query.yahooapis.com:80/v1/public/yql?q=select * from yahoo.finance.isin where symbol in ("DE000A1EWWW0")&env=store://datatables.org/alltableswithkeys
This returns the ticker ADS.DE inside an XML:
<query yahoo:count="1" yahoo:created="2015-09-21T12:18:01Z" yahoo:lang="en-US">
<results>
<stock symbol="DE000A1EWWW0">
<Isin>ADS.DE</Isin>
</stock>
</results>
</query>
<!-- total: 223 -->
<!-- pprd1-node600-lh3.manhattan.bf1.yahoo.com -->
I am afraid your example ISIN won't work, but that's an error on Yahoos side (see Yahoo Symbol Lookup, type your ISINs in there to check if the ticker exists on Yahoo).
The Implementation:
Sorry, I am not proficient in Java or R anymore, but this C# code should be almost similar enough to copy/paste:
public String GetYahooSymbol(string isin)
{
string query = GetQuery(isin);
XDocument result = GetHttpResult(query);
XElement stock = result.Root.Element("results").Element("stock");
return stock.Element("Isin").Value.ToString();
}
where GetQuery(string isin) returns the URI for the query to yahoo (see my example URI) and GetHttpResult(string URI) fetches the XML from the web. Then you have to extract the contents of the Isin node and you're done.
I assume you have already implemented the actual data fetch using ticker symbols.
Also see this question for the inverse problem (symbol -> isin). But for the record:
Query to fetch historical data for a symbol
http://query.yahooapis.com:80/v1/public/yql?q=select * from yahoo.finance.historicaldata where symbol in ("ADS.DE") and startDate = "2015-06-14" and endDate = "2015-09-22"&env=store://datatables.org/alltableswithkeys
where you may pass arbitrary dates and an arbitrary list of ticker symbols. It's up to you to build the query in your code and to pull the results from the XML you get back. The response will be along the lines of
<query xmlns:yahoo="http://www.yahooapis.com/v1/base.rng" yahoo:count="71" yahoo:created="2015-09-22T20:00:39Z" yahoo:lang="en-US">
<results>
<quote Symbol="ADS.DE">
<Date>2015-09-21</Date>
<Open>69.94</Open>
<High>71.21</High>
<Low>69.65</Low>
<Close>70.79</Close>
<Volume>973600</Volume>
<Adj_Close>70.79</Adj_Close>
</quote>
<quote Symbol="ADS.DE">
<Date>2015-09-18</Date>
<Open>70.00</Open>
<High>71.43</High>
<Low>69.62</Low>
<Close>70.17</Close>
<Volume>3300200</Volume>
<Adj_Close>70.17</Adj_Close>
</quote>
......
</results>
</query>
<!-- total: 621 -->
<!-- pprd1-node591-lh3.manhattan.bf1.yahoo.com -->
This should get you far enough to write your own code. Note that there are possibilities to get data as .csv format with &e=.csv at the end of the query, but I don't know much about that or if it will work for the queries above, so see here for reference.

I found a Web-Service which provides historic data based on date range. Please have a look
http://splice.xignite.com/services/Xignite/XigniteHistorical/GetHistoricalQuotesRange.aspx

Using Lucene, how to index TXT files into different fields?

I am using the NSF data whose format is txt. Now I have indexed these data and can send a query and got several results. But how can I search something in a selected field (eg. title) ? Because all of these NSF data are totally plain txt file. I do not think Lucene can recognize which part of the file is a "title" or something else. Should I firstly transfer the txt files to XML files (with tags telling Lucene which part is "title")? Can Lucene do that? I have no idea how to split the txt files into several fields. Can anyone please give me some suggestions? Thanks a lot!
BTW, every txt file looks like this:
---begin---
Title: Mitochondrial DNA and Historical Demography
Type: Award
Date: August 1, 1991
Number: 9000006
Abstract: asdajsfhsjdfhsjngfdjnguwiehfrwiuefnjdnfsd
----end----

You have to split the text into the several parts. You can use the resulting strings to create a field for each part of the text, i.e. title.
Create your lucene document with the fields like this:
Document doc = new Document();
doc.add(new Field("title", titleString, Field.Store.NO, Field.Index.TOKENIZED));
doc.add(new Field("abstract", abstractString, Field.Store.NO, Field.Index.TOKENIZED));
and so on. After indexing the document you can search in the title like this: title:dna
More complex queries and mixing multiple fields in the query also possible: +title:dna +abstract:"some example text" -number:935353

Extracting a column from a paragraph from a csv file using java

MAJOR ACC NO,MINOR ACC NO,STD CODE,TEL NO,DIST CODE
7452145,723456, 01,4213036,AAA
7254287,7863265, 01,2121920,AAA
FRUNDTE,FMACNO,FACCNO,FDISTCOD,FBILSEQ,FOOCTYP,FOOCDES,FOOCAMT,FSTD,FTELNO,FNORECON,FXFRACCN,FLANGIND,CUR
12345,71234,7643234,AAA,001,DX,WLR Promotion - Insitu /Pre-Cabled PSTN Connection,-37.87,,,0,,E,EUR
FRUNDTE,FMACNO,FACCNO,FDISTCOD,FBILSEQ,FORDNO,FREF,FCHGDES,FCHGAMT,CUR,FORENFRM,FORENTO
3242241,72349489,2345352,AAA,001,30234843P ,1,NEW CONNECTION - PRECABLED CHARGE,37.87,EUR,2123422,201201234
12123471,7618412389,76333232,AAA,001,3123443P ,2,BROKEN PERIOD RENTAL,5.40,EUR,201234523,20123601
I have a csv file something like the one above and I want to extract certain columns from it. For example I want to extract the first column of the first paragraph. I'm kind of new to java but I am able to read the file but I want to extract certain columns from different paragraphs. Any help will be appreciated.

How to retrieve the Field that "hit" in Lucene

Maybe I'm really missing something.
I have indexed a bunch of key/value pairs in Lucene (v4.1 if it matters). Say I have
key1=value1 and key2=value2, e.g. as read from a properties file.
They get indexed both as specific fields and into a catchall "ALL" field, e.g.
new Field("key1", "value1", aFieldTypeMimickingKeywords);
new Field("key2", "value2", aFieldTypeMimickingKeywords);
new Field("ALL", "key1=value1", aFieldTypeMimickingKeywords);
new Field("ALL", "key2=value2", aFieldTypeMimickingKeywords);
// then get added to the Document of course...
I can then do a wildcard search, using
new WildcardQuery(new Term("ALL", "*alue1"));
and it will find the hit.
But, it would be nice to get more info, like "what was complete value (e.g. "key1=value1") that goes with that hit?".
The best I can figure out it to get the Document, then get the list of IndexableFields, then loop over all of them and see if the field.stringValue().contains("alue1"). (I can look at the data structures in the debugger and all the info is there)
This seems completely insane cause isn't that what Lucene just did? Shouldn't the Hit information return some of the Fields?
Is Lucene missing what seems like "obvious" functionality? Google and starting at the APIs hasn't revealed anything straightforward, but I feel like I must be searching on the wrong stuff.

You might want to try with IndexSearcher.explain() method. Once you get the ID of the matching document, prepare a query for each field (using the same search keywords) and invoke Explanation.isMatch() for each query: the ones that yield true will give you the matched field. Example:
for (String field: fields){
Query query = new WildcardQuery(new Term(field, "*alue1"));
Explanation ex = searcher.explain(query, docID);
if (ex.isMatch()){
//Your query matched field
}
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Extracting word count From an XML File - java

Related

Java XML Parsing into object

Get historic prices by ISIN from yahoo finance

Using Lucene, how to index TXT files into different fields?

Extracting a column from a paragraph from a csv file using java

How to retrieve the Field that "hit" in Lucene

Categories

Resources