Parse text- Scanner or BufferedReader? - java

For my data structures class, the first project requires a text file of songs to be parsed.
An example of input is:
ARTIST="unknown"
TITLE="Rockabye Baby"
LYRICS="Rockabye baby in the treetops
When the wind blows your cradle will rock
When the bow breaks your cradle will fall
Down will come baby cradle and all
"
I'm wondering the best way to extract the Artist, Title and Lyrics to their respective string fields in a Song class. My first reaction was to use a Scanner, take in the first character, and based on the letter, use skip() to advance the required characters and read the text between the quotation marks.
If I use this, I'm losing out on buffering the input. The full song text file has over 422K lines of text. Can the Scanner handle this even without buffering?

For something like this, you should probably just use Regular Expressions. The Matcher class supports buffered input.
The find method takes an offset, so you can just parse them at each offset.
http://download.oracle.com/javase/1.4.2/docs/api/java/util/regex/Matcher.html
Regex is a whole world into itself. If you've never used them before, start here http://download.oracle.com/javase/tutorial/essential/regex/ and be prepared. The effort is so very worth the time required.

If the source data can be parsed using one token look ahead, StreamTokenizer may be a choice. Here is an example that compares StreamTokenizer and Scanner.

In this case, you could use a CSV reader, with the field separator '=' and the field delimiter '"' (double quote). It's not perfect, as you get one row for ARTIST, TITLE, and LYRICS.

Related

How to keep spaces where they were after modifying string

So I have to get words from a text file, change them, and put them into a new text file.
The problem I'm having is, lets say the first line of the file is
hello my name is bob
the modified result should be:
ellohay myay amenay isay bobay
but instead, the result ends up being
ellomynameisbobhay
so scanner has .nextLine() but I want to have a method that is .nextWord() or something, so that it will recognize something as a word until it has a space after it. how can I create this?
nextLine() gives you the whole line.
What you should use is just next(), that will give you the next word.
Also see String.split() or StringTokenizer if you wanted to post-process whole lines. It sound s as though in your situation just using the scanner is fine, but I though i'd mention it because I assumed you'd have just used those methods if you knew about them.

Suggested ways of reading a text file with inconsistent formatting

I'm trying to read a text file of numbers as a double array and after various methods (usually resulting in an input format exception) I have come to the conclusion that the text file I am trying to read is inconsistent with it's delimiting.
The majority of the text format is in the form "0.000,0.000" so I have been using a Scanner and the useDelimiter(",") to read in each value.
It turns out though (this is a big file of numbers) that some of the formatting is in the form "0.000 0.000" (at the end of a line I presume) which of course produces an input format exception.
This is an open question really, I'm a pretty basic Java programmer so I would just like to see if there are any suggestions/ways of performing this. Is Scanner the correct class to go on this?
Thank you for your time!
Read file as text line-by-line. Then split line into parts:
String[] parts = line.split("[ ,]");
Now iterate over the parts and call Double.parseDouble() for each part.
Scanner allows any Java Regex Pattern to function as a delimiter. You should be able to use any number of delimiters by doing the following:
scanner.setDelimiter("[,\\s]"); // Will match commas and whitespace
I'd like to comment this in instead of making it a separate answer, but my reputation is too low. Apologies, Alex.
You mentioned having two different delimited characters used in different instances, not a combination of the two as a single delimiter.
You can use the vertical bar as logical OR in a regular expression.
scanner.setDelimiter("[,|\\s]"); //Will match commas or whitespace as appropriate
line by line:
String[] parts = line.split("[,|\\s]");

Java Splitting a String

I have this string
G234101,Non-Essential,ATPases,Respiration chain complexes,"Auxotrophies, carbon and",PS00017,2,IONIC HOMEOSTASIS,mitochondria.
That I have been trying to split in java. The file is comma delimeted but some of the strings have commas within them and I don't want them to get split up. Currently in the above example
"Auxotrophies, carbon and"
is getting split into two strings.
Any suggestions on how to best split this up by comma's. Not all of the strings have the " " for example the following string:
G234103,Essential,Protein Kinases,?,Cell cycle defects,PS00479,2,CELLULAR COMMUNICATION/SIGNAL TRANSDUCTION,cytoplasm.
http://opencsv.sourceforge.net/
But if you really do need to reinvent the wheel (homework), you need to use a more complicated regular expression than just "what,ever".split(","). It's not simple though. And you might be better off creating your own custom Lexer. http://en.wikipedia.org/wiki/Lexical_analysis
This isn't too hard in your case. As you process your text character by character you just need to keep track of opening and closing quotes to decide when to ignore commas and when to act on them.
Also see StreamTokenizer for a built-in configurable Lexer - you should be able to use this to meet your requirements.
I would think that this would be a multi step process. First, find all the comma's in quotes from your original string, replace it with something like {comma}. You can do this with some regex. Then on the new string, split the new string with the comma symbol(,). Then go through your list, and replace the {comma} with the comma symbol {,}.

PushbackReader Equivalents

A somewhat vague question, I apologize in advance.
I'm building the tokenizing portion of a small parser with help of the book Building Parsers with Java. It uses PushbackReader and the String contained within as a way to first detect the first character of the given string then sends the PushbackReader to the appropriate state (the state then builds the token as a separate object containing a String).
PushbackReader seems to only be used if no other characters of use are found within the the stream. It then unreads the last character.
Is it possible to do the same thing with a CharBuffer's append? Preferably something that doesn't require the buffer to be predefined.
Based on what I see, he chose PushbackReader for two reasons:
He needed a reader that could handle individual characters.
He needed to backup in the stream because when tokenizing he needed to see one character or more ahead to decide if the current char was part of the token.
For example with the method WhitespaceState.nextToken he is skipping whitespace characters. He pulls off a character and looks at it. If it is a whitespace char he pulls the next char. When he finally pulls a character that is not whitespace, he puts it back into the stream so the next method that looks at the stream will be looking at the correct character.
While you could replace it with something more simple that has just two methods, read(), and unread(), you have to remember that by doing so you will probably be
Reading in the entire input, and then processing the input. So if you have a large file you will be eating up memory to store it.
Reading the input once as a stream, but storing the char(s) from unread() and passing them around in a separate structure.
With PushbackReader, he is reading and processing through the input once, he does not have to buffer the entire input, nor is he having to store the unread() characters and pass them around separately

Parsing of data structure in a plain text file

How would you parse in Java a structure, similar to this
\\Header (name)\\\
1JohnRide 2MarySwanson
1 password1
2 password2
\\\1 block of data name\\\
1.ABCD
2.FEGH
3.ZEY
\\\2-nd block of data name\\\
1. 123232aDDF dkfjd ksksd
2. dfdfsf dkfjd
....
etc
Suppose, it comes from a text buffer (plain file).
Each line of text is "\n" - limited. Space is used between the words.
The structure is more or less defined. Ambuguity may sometimes be, though, case
number of fields in each line of information may be different, sometimes there may not
be some block of data, and the number of lines in each block may vary as well.
The question is how to do it most effectively?
First solution that comes to my head is to use regular expressions.
But are there other solutions? Problem-oriented? Maybe some java library already written?
Check out UTAH: https://github.com/sonalake/utah-parser
It's a tool that's pretty good at parsing this kind of semi structured text
As no one recommended any library, my suggestion would be : use REGEX.
From what you have posted it looks like the data is delimited by whitespace. One idea is to use a Scanner or a StringTokenizer to get one token at a time. You can then check the first char of a token to see if it is a digit (in which case the part of the token after the digit(s) will be the data, if there is any).
This sounds like a homework problem so I'm going to try to answer it in such a way to help guide you (not give the final solution).
First, you need to consider each object of data you're reading. Is it a number then a text field? A number then 3 text fields? Variable numbers and text fields?
After that you need to determine what you're going to use to delimit each field and each object. For example, in many files you'll see something like a semi-colon between the fields and a new line for the end of the object. From what you said it sounds like yours is different.
If an object can go across multiple lines you'll need to bear that in mind (don't stop partway through an object).
Hopefully that helps. If you research this and you're still having problems post the code you've got so far and some sample data and I'll help you to solve your problems (I'll teach you to fish....not give you fish :-) ).
If the fields are fixed length, you could use a DataInputStream to read your file. Or, since your format is line-based, you could use a BufferedReader to read lines and write yourself a state machine which knows what kind of line to expect next, given what it's already seen. Once you have each line as a string, then you just need to split the data appropriately.
E.g., the password can be gotten from your password line like this:
final int pos = line.indexOf(' ');
String passwd = line.substring(pos+1, line.length());

Categories