populate my data structure with thousands of real english words - java

I need to test my data structure (in java) which is like a dictionary - holds a key/value map. I would like to know how do you test your data structure? I would like to insert real words in my data structure and then find them. I am wondering if there is a way to download all the english words and then I can read that file and populate my structure. Once populated, I can perform many searches and produce some real statistics of how long does it take to search?

There are indeed several open-source dictionaries for the English language, e.g. the WordNet file.
That said, I must insist that the English language is not a “closed” language, nor does it have one true official definition. As such, there is no dictionary that contains “all the English words” and such a dictionary can never exist: English words are made up all the time, and once enough people use them, the become part of the English language. Case in point: “to google.”

Perhaps Project Gutenberg would be helpful. I've used them on past CS projects. They provide plain text files (e.g. The Valley of Fear), which should be easy to process. You may want to skip over the headers to avoid skewing the results.
This will let you test your dictionary by keeping e.g. a word->count mapping (e.g. Map<String, Integer>) of the words in the file.

If you're on Linux, you could use the contents of /usr/share/dict/words; there's also WordNet, an English word database.

If you have a key-value pair you probably don't want a simple list of words, you want words to definitions or to words in other languages.
If you don't mind parsing a text file, IDP has a bunch of files for download royalty free.

Related

How Can I Access the Brown Corpus in Java (aka outside of NLTK)

I'm trying to write a program that makes use of natural language parts-of-speech in Java. I've been searching on Google and haven't found the entire Brown Corpus (or another corpus of tagged words). I keep finding NLTK information, which I'm not interested in. I want to be able to load data into a Java program and sum up occurrences of words (and what % likelihood they are to be what part-of-speech).
I do not want to use a Java library like the Stanford one, I want to play with the corpus data myself.
Here's a link to the download page for the Brown Corpus: http://www.nltk.org/nltk_data/
All the files are zip files. The data format is described on the Brown Corpus Wikipedia. I dunno what else to say. From there things should be obvious.
EDIT: if you want original source data, I think there's some corpuses out there that have their data. However usually the point is to let someone else do the sampling. Also, note this from the the Wikipedia entry: "Each sample began at a random sentence-boundary in the article or other unit chosen, and continued up to the first sentence boundary after 2,000 words." So the data for the Brown Corpus is essentially randomized. Even if you had the original texts you might not be able to guess where they sampled.
Data is data. The NLTK data is not in an obscure, encrypted, or difficult format. Just write java code to read it. You might find a shortcut in WEKA, or you might not.
If you don't want to mess with the NLTK interface: The Brown corpus has been deposited at the Internet Archive (archive.org). On https://archive.org/details/BrownCorpus you'll find a link to a zip archive containing the entire corpus. (Also a torrent link, but it doesn't seem worth the trouble for 3.2 MB.)

Is it possible to do this type of search in Java

I am stuck on a project at work that I do not think is really possible and I am wondering if someone can confirm my belief that it isn't possible or at least give me new options to look at.
We are doing a project for a client that involved a mass download of files from a server (easily did with ftp4j and document name list), but now we need to sort through the data from the server. The client is doing work in Contracts and wants us to pull out relevant information such as: Licensor, Licensee, Product, Agreement date, termination date, royalties, restrictions.
Since the documents are completely unstandardized, is that even possible to do? I can imagine loading in the files and searching it but I would have no idea how to pull out information from a paragraph such as the licensor and restrictions on the agreement. These are not hashes but instead are just long contracts. Even if I were to search for 'Licensor' it will come up in the document multiple times. The documents aren't even in a consistent file format. Some are PDF, some are text, some are html, and I've even seen some that were as bad as being a scanned image in a pdf.
My boss keeps pushing for me to work on this project but I feel as if I am out of options. I primarily do web and mobile so big data is really not my strong area. Does this sound possible to do in a reasonable amount of time? (We're talking about at the very minimum 1000 documents). I have been working on this in Java.
I'll do my best to give you some information, as this is not my area of expertise. I would highly consider writing a script that identifies the type of file you are dealing with, and then calls the appropriate parsing methods to handle what you are looking for.
Since you are dealing with big data, python could be pretty useful. Javascript would be my next choice.
If your overall code is written in Java, it should be very portable and flexible no matter which one you choose. Using a regex or a specific string search would be a good way to approach this;
If you are concerned only with Licensor followed by a name, you could identify the format of that particular instance and search for something similar using the regex you create. This can be extrapolated to other instances of searching.
For getting text from an image, try using the API's on this page:
How to read images using Java API?
Scanned Image to Readable Text
For text from a PDF:
https://www.idrsolutions.com/how-to-search-a-pdf-file-for-text/
Also, PDF is just text, so you should be able to search through it using a regex most likely. That would be my method of attack, or possibly using string.split() and make a string buffer that you can append to.
For text from HTML doc:
Here is a cool HTML parser library: http://jericho.htmlparser.net/docs/index.html
A resource that teaches how to remove HTML tags and get the good stuff: http://www.rgagnon.com/javadetails/java-0424.html
If you need anything else, let me know. I'll do my best to find it!
Apache tika can extract plain text from almost any commonly used file format.
But with the situation you describe, you would still need to analyze the text as in "natural language recognition". Thats a field where; despite some advances have been made (by dedicated research teams, spending many person years!); computers still fail pretty bad (heck even humans fail at it, sometimes).
With the number of documents you mentioned (1000's), hire a temp worker and have them sorted/tagged by human brain power. It will be cheaper and you will have less misclassifications.
You can use tika for text extraction. If there is a fixed pattern, you can extract information using regex or xpath queries. Other solution is to use Solr as shown in this video.You don't need solr but watch the video to get idea.

Do I need the original files used to create an index in Lucene?

First of all as I explained in my profile, I'm not English native, so I hope you can forgive me if I make some grammar mistakes.
I'm trying to develop with the Apache Lucene API in Java. I'm able to write some index and search methods, but I'm still confused about how it works behind the scenes.
From what I know, Lucene doesn't care about the origin of the data. It just takes the data and indexes it. Let me ask with a simple example:
I want to index words from a .txt based dictionary. Once Lucene has made its indexes, do I need the source .txt dictionary anymore? How, exactly, do the indexes work?
Do the indexes contain the necessary content to perform searches without the original source? Or do the indexes only contain directions to where words are stored in the original source .txt dictionary file? I'm a little confused.
Once you have indexed everything, Lucene does not refer back to, or have further need of, any of the source documents. Everything it needs to operate is saved in it's index directory. Many people use Lucene to index files, others database records, others online resources. Whatever your source, you always have to bring in the data yourself (or with some third-party tool), and construct Documents for lucene to index, and nothing about a document says anything about where it came from. So, not only does lucene not need to refer back to original data sources, it couldn't find them if you wanted it to.
Many people's implementations do rely on having the original source present. It's not at all unusual for people to set up Lucene to index everything, but only store a file name, or database id, or some similar pointer to the original source. This allows them to perform an effective full-text search through lucene, while handling storage of the full content to some other system.

Java: ArrayList, String manipulation and Parsing

I am writing an application that stores references for books, journals, sites and so on. I mean I have already done most.
What I want is a suggestion regarding what is the best way to implement above specs?
What format text file should I use to store the library? Not file type but format. I am using simple text file at the moment. But planning to implement format as in below.
<book><Art of Human Hacking><New York><2011><1>
<journal><Achieving Maximum Speed In 802.11n><London><2009><260-265>
1st tag <book> and <journal> are type identifier. I have used ArrayList. Should I use multi dimensional ArrayList and store each item like below?
[[Art of Human Hacking,New York,2011,1][Achieving Maximum Speed In 802.11n,London,2009,260-265]]
I have used StringTokenizer and I cannot differentiate spaces. How do I fix this?
I have already implemented all features including listing all, listing unique, searching, editing, removing, adding. But everything is done to content without spaces.
You should use an existing serializer instead of writing your own, unless the project forbids it.
For compatability and human readability, CSV would be your best bet. Use an existing CSV parser to get your escaping correct (not that hard to write yourself, but difficult enough to warrant using an existing parser to avoid bugs). Google turned up: http://sourceforge.net/projects/javacsv/
If human editing is not a priority, then JSON is also a good format (it is human readable, but not quite as simple as CSV, and won't display in excel for instance).
Then you have binary protocols, such as native Java serialization, or even Google protocol buffers. These may be faster, but obviously lose the ability to view the data store and would probably complicate your debugging.

Indexing semi-structured data

I would like to index a set of documents that will contain semi structured data, typically key-value pairs something like #author Joe Bloggs. These keywords should then be available as searchable attributes of the document which can be queried individually.
I have been looking at Lucene and I'm able to build an index over the documents I'm interested in but I'm not sure how best to proceed with the next step of keyword extraction.
Is there a common approach for doing this in Lucene or another indexing system? I'd like to be able to search over the documents using a typical word search as I'm able to already, and so would like a something more than a custom regex extraction.
Any help would be greatly appreciated.
Niall
I wrote a source code search engine using Lucene as part of my bachelor thesis. One of the key features, was that the source code was treated as structured information, and therefore should be searchable as such, i.e. searchable according to attributes as you describe above.
Here you can find more information about this project. If that is to extensive for you, I can sum up some things:
I created separate searching fields for all the attributes which should be searchable. In my case those where for example 'method name' or 'commentary' or 'class name'.
It can be advantageous to have the content of these fields overlap, however this will blow up your database index (but only linearly with the occurrence of redundant data in searchable fields).

Categories