how to parse a file

how to parse a file - java

Alright, i have an assignment and i dont know how to parse the file. Is string tokenizer my best option?
The file has commas, newlines and spaces. S is the starting state and small a is the input and the big A is the next state. Should i parse the file into seperate variables and run it through a switch case to simulate a state machine?
This is the file
‘Ends in a
2
S, a, A
S, b, S
A, a, A
A, b, S
F: A
aba
bbaabba
bbabab
aaaab
b
a
Thank you so much because i just cant seem to get started...

My biggest question is how can i parse the file?
Like any other text file. There are literally millions of examples on how to do this on the web.
I would look for examples using the Scanner class.
I am not very good at parsing files. Especially in this situation.
With practice it will get easier. Doing this assignment will help.
Should i use dilimeters?
The file has delimiters so I don't why you wouldn't.
comma and newline?
Your file has commas, newlines and spaces.
and put the states into an array and the inputs ( a,b) into a second array?
Java is an object orientated programming language. Perhaps using Collections like Map and Objects is a better choice.
Should i check for digits, isaplha?
I would just assume the file is formatter correctly and read numbers when you expect to have a number and strings when you expect to have a word/token.
lower case and uppercase alpha?
Not sure if this is a consideration.
i am thinking i need a switch and a couple of cases to handle the state transitions?
If your states were handled in Java code, I would say yes. However you states are being read from a text files and stored in a data structure. In this case its simpler not to use switches.
Can someone explain how i should go about handling this file so i can process it?
Read it, store the data in a structure, process the inputs.
I am also confused on how to handle the :F A in that file..
This is information you need to record to determine when your DFA stops.

Java is an object-oriented language so build a series of classes that reflect the real world.
Example:
What do you have? And what do they need to be able to do
DFA
has a series of states
needs to be able to accept/reject input strings
State
has a collection of inputs to look for and states to transition to based on input
needs to be able to check for a token and transition to a new state
So these kind govern how you should lay out your classes (members and methods). So you should make a DFA class and it should have a method: public boolean process(String input).

Related

How to generate all possible sentence from given tokens in Java

I am trying to generate all possible sentences from given token. It is a transliteration program. I have various possibilities for each token to be transliterated and I want to generate all possible sentences. e.g. if sentence is token1 token2 token3 and supposing token1 can be represented in 3 ways after transliteration, token2 can be represented by 2 ways and token3 can be represented by 4 ways then total possible sentences are 24. I am developed a general tree and then perform depth first traversal to generate all possible sentences. the problem is when sentence become long, the number of possibilities increases and I got "java.lang.OutOfMemoryError: Java heap space" error.
Is there any other way to generate all possible sentences?? At some instances I need to generate millions of sentences. Please Help!!!

You can't generate them all at once like that.
Depending on what you need them for, you should either do whatever that is or write them to a file.
Another thought, that still might not work, would be to not store every possible value but store a set of references/relationships. You can make this much more complex with n-grams and mMrkov chains, or simply have a a set of references, or even just have a list of array indexes.
So besides using storage space as a memory buffer, you can conceptualize instead of foo calling gen for the full set, have gen call foo after each one is generated.
[EDIT: looking back on this, (I was interested to see any other answers) I want to clarify that the function foo is whatever you're using them for and the function gen generates them (just in case it isn't clear, and especially for anyone who's first language isn't english)]

how to find whether a substring in file is already present in hashmap?

I have a hashMap(guava bimap) in which keys and values both are strings, I wanted to write a program which parses the given file and replaces all the strings in the file which are also in BiMap with corresponding values from Bimap.
for example: i have a file called test.txt has following text
Java is a set of several computer software and specifications developed by Sun Microsystems.
and my BiMap has
"java i" => "value1"
"everal computer" => "value2" etc..
So now i want my program to take test.txt and Bimap as input and give an output which looks something like this
value1s a set of svalue2 software and specifications developed by Sun Microsystems.
please point me towards any algorithm which can do this, the program takes large files as input so brute force may not be a good idea.
Edit: I'm using fixed length strings for keys and values.
That example was just intended to show the operation.
Thanks.

For a batch operation like this, I would avoid putting a lot of data into the memory. Therefore I'd recommend you to write the new content into a new file. If the file in the end must be the exact same file, you can still replace one file by the other, at the end of the process. read, write and flush each new line separately, and you won't have any memory issues.

What type of Trie is this?

I want to add words an opensource Java word splitting program for Khmer (a language that does not have spaces between words). The developers have not worked on it in a long time, and I haven't been able to contact them for details (http://sourceforge.net/projects/khmer/files/Khmer%20Word%20Breaking/Khmer%20Word%20Breaking%20program%20V1.0/). Supposedly the list was created from a Khmer dictionary, and I would like to re-create the file to include more words.
Can anyone identify what format the word dictionary is in (I believe it is some type of Trie)? Here are the first few lines:
0ឳមអគណជយឍឫហកដពទឱលថឦឡញឩខនឧផប។ឋវឭឈឃឥឌឰឪសងចភធឯតឆរ
1ទ
0ក
1
1ីែមគួណជយ៍ៀហកទុលេញ៉ឺនំឹៃូឈឃោាឿសងចិ្ធើតៅរ
1គនសងរ
0ទ
0ា
0យ
0ព
0ន
1
1រ
0ា
0ស
0ី
1
And does anyone know how I would go about making a new one (I have a large wordlist, but I am not sure how to get it into this format).
Thanks!

After a quick look through the code, I have a theory.
Create a SearchTree which extends TreeItem. For each word in your dictionary, call addWord from TreeItem. When the iteration is done, call export on SearchTree. Use new file as the word input file.
Additionally, there may be an undocumented parameter for khwrdbrk.jar, --create, that will read the words for the new tree from standard input.
Again, just a theory, but let me know what happens if you test it out.

is there a dictionary i can download for java?

is there a dictionary i can download for java?
i want to have a program that takes a few random letters and sees if they can be rearanged into a real word by checking them against the dictionary

Is there a dictionary i can download
for java?
Others have already answered this... Maybe you weren't simply talking about a dictionary file but about a spellchecker?
I want to have a program that takes a
few random letters and sees if they
can be rearranged into a real word by
checking them against the dictionary
That is different. How fast do you want this to be? How many words in the dictionary and how many words, up to which length, do you want to check?
In case you want a spellchecker (which is not entirely clear from your question), Jazzy is a spellchecker for Java that has links to a lot of dictionaries. It's not bad but the various implementation are horribly inefficient (it's ok for small dictionaries, but it's an amazing waste when you have several hundred thousands of words).
Now if you just want to solve the specific problem you describe, you can:
parse the dictionary file and create a map : (letters in sorted order, set of matching words)
then for any number of random letters: sort them, see if you have an entry in the map (if you do the entry's value contains all the words that you can do with these letters).
abracadabra : (aaaaabbcdrr, (abracadabra))
carthorse : (acehorrst, (carthorse) )
orchestra : (acehorrst, (carthorse,orchestra) )
etc...
Now you take, say, three random letters and get "hsotrerca", you sort them to get "acehorrst" and using that as a key you get all the (valid) anagrams...
This works because what you described is a special (easy) case: all you need is sort your letters and then use an O(1) map lookup.
To come with more complicated spell checkings, where there may be errors, then you need something to come up with "candidates" (words that may be correct but mispelled) [like, say, using the soundex, metaphone or double metaphone algos] and then use things like the Levenhstein Edit-distance algorithm to check candidates versus known good words (or the much more complicated tree made of Levenhstein Edit-distance that Google use for its "find as you type"):
http://en.wikipedia.org/wiki/Levenshtein_distance
As a funny sidenote, optimized dictionary representation can store hundreds and even millions of words in less than 10 bit per word (yup, you've read correctly: less than 10 bits per word) and yet allow very fast lookup.

Dictionaries are usually programming language agnostic. If you try to google it without using the keyword "java", you may get better results. E.g. free dictionary download gives under each dicts.info.

OpenOffice dictionaries are easy to parse line-by-line.
You can read it in memory (remember it's a lot of memory):
List words = IOUtils.readLines(new FileInputStream("dicfile.txt")) (from commons-io)
Thus you get a List of all words. Alternatively you can use the Line Iterator, if you encounter memory prpoblems.

If you are on a unix like OS look in /usr/share/dict.

Here's one:
http://java.sun.com/docs/books/tutorial/collections/interfaces/examples/dictionary.txt
You can use the standard Java file handling to read the word on each line:
http://www.java-tips.org/java-se-tips/java.io/how-to-read-file-in-java.html

Check out - http://sourceforge.net/projects/test-dictionary/, it might give you some clue
I am not sure if there are any such libraries available for download! But I guess you can definitely digg through sourceforge.net to see if there are any or how people have used dictionaries - http://sourceforge.net/search/?type_of_search=soft&words=java+dictionary

Finite State Machine program

I am tasked with creating a small program that can read in the definition of a FSM from input, read some strings from input and determine if those strings are accepted by the FSM based on the definition. I need to write this in either C, C++ or Java. I've scoured the net for ideas on how to get started, but the best I could find was a Wikipedia article on Automata-based programming. The C example provided seems to be using an enumerated list to define the states, that's fine if the states are hard coded in advance. Again, I need to be able to actually read the number of states and the definition of what each state is supposed to do. Any suggestions are appreciated.
UPDATE:
I can make the alphabet small (e.g. { a b }) and adopt other conventions such as the
start state is always state 0. I'm allowed to impose reasonable restrictions on the number of
states, e.g. no more than 10.
Question summary:
How do I implement an FSA?

First, get a list of all the states (N of them), and a list of all the symbols (M of them). Then there are 2 ways to go, interpretation or code-generation:
Interpretation. Make an NxM matrix, where each element of the matrix is filled in with the corresponding destination state number, or -1 if there is none. Then just have an initial state variable and start processing input. If you get to state -1, you fail. If you run out of input symbols without getting to the success state, you fail. Otherwise you succeed.
Code generation. Print out a program in C or your favorite compiler language. It should have an integer state variable initialized to the start state. It should have a for loop over the input characters, containing a switch on the state variable. You should have one case per state, and at each case, have a switch statement on the current character that changes the state variable.
If you want something even faster than 2, and that is sure to get you flunked (!), get rid of the state variable and instead use goto :-) If you flunk, you can comfort yourself in the knowledge that that's what compilers do.
P.S. You could get your F changed to an A if you recognize loops etc. in the state diagram and print out corresponding while and if statements, rather than using goto.

One non-hardcoded way to represent an automaton is as a transition matrix, which allows to represent for each current state, and each input character, what the next state is.

You haven't actually asked a question. You'll get more and better help if you have a specific question for a specific task (but still give the overall goal). The question should be narrow in scope (e.g. not "How can I implement an FSA?").
As for how to represent an FSA (which seems to be what you're having difficulties with), read on.
Start by considering the definition of an FSM: it's an alphabet ∑, a set of states S, a start state s0, a set of accept states A and a transition function δ from a state and a symbol to a state. You have to be able to determine these properties from the input. Any states not reachable by the transition function can be dropped to produce an equivalent FSM. The minimal set of states and alphabet are thus implicit in the transition function; you could make your FSM easier to use (and harder to implement, but not much harder) by not requiring either ∑ or S in the input.
You don't need to use the same representation for states that the input uses. You could use unsigned integers for your internal representation, as long as you have a map from integers to strings and strings to integers so you can convert between the internal representation and external representation. This way, your transition function can be stored as an array, so the transition step can be performed in constant time.
A simpler approach would be to use the external representation as your internal representation. With this option, the transition function would be stored as a map from strings and symbols to strings. The transition step would probably be O(log(|S|+|∑|)), given the performance of most map data structures. If symbols are represented as integers (e.g. chars), the transition function could be represented as a map from strings to an array of strings, giving O(log(|S|)) performance.
Yet another optionmodeled after the graph view of an FSM, is to create a class for states. A state has a name (the external representation). States are responsible for transitions; send a symbol to a state and get back another state.
class State {
property name;
State& transition(Symbol s);
void setTransition(Symbol s, State& to);
}
Store the set of states as a map from names to states.
There you go, three different places to start, each with a different way to represent states.

Stop thinking about everything at once. Do one thing at a time
- come with language of state machine
- come with language for stimulus
- create sample file of one state machine in language
- create sample file of stimulus
- come with class for state
- come with class for transition
- come with class for state machine as set of states and transitions
- add method to handle violation to state class
- code a little parser for language
- code another parser for language
- initial state
- some output thing like WriteLn here and there
- main method
- compile
- run
- debug
- done

The way the OpenFst toolkit does it is: A FSM has a vector of states, each of which has a vector of arcs. Each arc has an input (and output) label, a target state ID and a weight. You could take a look at the code. Maybe it will inspire you.

If you're using an object-oriented language like Java or C++, I'd recommend that you start with objects. Before you worry about file formats and the like, get a good object model for a finite state automata and how it behaves. How will you represent states, transitions, events, etc.? Will your FSA be a Composite? Once you have that sort of thing working you can get the file formats right. Anything will do: XML, text, etc.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

how to parse a file - java

Related

How to generate all possible sentence from given tokens in Java

how to find whether a substring in file is already present in hashmap?

What type of Trie is this?

is there a dictionary i can download for java?

Finite State Machine program

Categories

Resources