I have a java program that reads lines of a text file into a buffer and when the buffer is full it outputs the lines so that after all lines have been through the buffer the output is partially sorted.
The output will be in blocks of lines so I need a way to mark the end of each block in the output. Since the output is lines of text I'm not sure what character to use as a marker since the text can contain any characters. I'm thinking of using the ascii null or unit separator but I'm not sure if this would be reliable since it could also be in text.
You could use a Map, so you can set a key for every buffergroup something like that
Hash<int,Buffer> myMap = new HashMap<>();
if you are not sure how to discriminate lines, I suggest you take a look at a sentence tokenizer tool which is usually used in NLP. These programs contain patterns that discriminate lines from each other. That way, you can send all your date through and get the lines without worrying about wich character to use. There are plenty libraries for Java which does the job perfectly (Assuming your text is in English)
Related
I'm trying to read a text file of numbers as a double array and after various methods (usually resulting in an input format exception) I have come to the conclusion that the text file I am trying to read is inconsistent with it's delimiting.
The majority of the text format is in the form "0.000,0.000" so I have been using a Scanner and the useDelimiter(",") to read in each value.
It turns out though (this is a big file of numbers) that some of the formatting is in the form "0.000 0.000" (at the end of a line I presume) which of course produces an input format exception.
This is an open question really, I'm a pretty basic Java programmer so I would just like to see if there are any suggestions/ways of performing this. Is Scanner the correct class to go on this?
Thank you for your time!
Read file as text line-by-line. Then split line into parts:
String[] parts = line.split("[ ,]");
Now iterate over the parts and call Double.parseDouble() for each part.
Scanner allows any Java Regex Pattern to function as a delimiter. You should be able to use any number of delimiters by doing the following:
scanner.setDelimiter("[,\\s]"); // Will match commas and whitespace
I'd like to comment this in instead of making it a separate answer, but my reputation is too low. Apologies, Alex.
You mentioned having two different delimited characters used in different instances, not a combination of the two as a single delimiter.
You can use the vertical bar as logical OR in a regular expression.
scanner.setDelimiter("[,|\\s]"); //Will match commas or whitespace as appropriate
line by line:
String[] parts = line.split("[,|\\s]");
I need to write a parser for textfiles (at least 20 kb), and I need to determine if words out of a set of words appear in this textfile (about 400 words and numbers). So I am looking for the most efficient possibilitie to do this (if a match is found, i need to do some further processing of this and it's previous line).
What I currently do, is to exclude lines that do not contain any information for sure (kind of metadata lines) and then compare word by word - but i don't think that only comparing word by word is the most efficient possibility.
Can anyone please provide some tips/hints/ideas/...
Thank you very much
It depends on what you mean with "efficient".
If you want a very straightforward way to code it, keep in mind that the String object in java has method String.contains(CharSequence sequence).
Then, you could put the file content into a String and then iterate on your keywords you want to check to see if any of those appear in String, using the method contains().
How about the following:
Put all your keywords in a HashSet (Set<String> keywords;)
Read the file one line at once
For each line in file:
Tokenize to words
For each word in line:
If word is contained in keywords (keywords.containes(word))
Process actual line
If previous line is available
Process previous line
Keep track of previous line (prevLine = line;)
Simplifying my problem a bit, I have a set of text files with "records" that are delimited by double newline characters. Like
'multiline text'
'empty line'
'multiline text'
'empty line'
and so forth.
I need to transform each multiline unit separately and then perform mapreduce on them.
However, I am aware that with the default wordcount setting in the hadoop code boilerplate, the input to the value variable in the following function is just a single line and there are no guarantees that the input is contiguous with the previous input line.
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException ;
And I need it to be that the input value is actually one unit of the double newline delimited multiline text.
Some searching turned up a RecordReader class and a getSplits method but no simple code examples that I could wrap my head around.
An alternative solution is to just replace all newline characters in the multiline text with space characters and be done with it. I'd rather not do this because there's quite a bit of text and it's time consuming in terms of runtime. I also have to modify a lot of code if I do this so dealing with it through hadoop would be most attractive for me.
If your files are small in size, then they won't get split. Essentially each file is one split assigned to one mapper instance. In this case, I agree with Thomas. You can build your logical record in your mapper class, by concatenating strings. You can detect your record boundary by looking for an empty string coming in as value to your mapper.
However, if the files are big and get split, then I don't see any other option but to implement your own text input format class. You could clone existing Hadoop LineRecordReader and LineReader java classes. You have to make a small change in your version of LineReader class so that the record delimiter will be two new lines, instead of one. Once this done, your mapper will receive multiple lines as input value.
What's the problem with it? Just put the previous lines into a StringBuilder and flush it when you reach a new record.
When you are using textfiles, they won't get split. For these cases it uses FileInputFormat, which only parallelizes to the number of files available.
What will be the most eficient way to split a file in Java ?
Like to get it grid ready...
(Edit)
Modifying the question.
Basically after scouring the net I understand that there are generally two methods followed for file splitting....
Just split them by the number of bytes
I guess the advantage of this method is that it is fast, but say I have all the data in a line and suppose the file split puts half the data in one split and the other half the data in another split, then what do I do ??
Read them line by line
This will keep my data intact, fine, but I suppose this ain't as fast as the above method
Well, just read the file line by line and start saving it to a new file. Then when you decide it's time to split, start saving the lines to a new place.
Don't worry about efficiency too much unless it's a real problem later.
My first impression is that you have something like a comma separated value (csv) file. The usual way to read / parse those files is to
read them line by line
skip headers and empty lines
use String#split(String reg) to split a line into values (reg is chosen to match the delimiter)
How would you parse in Java a structure, similar to this
\\Header (name)\\\
1JohnRide 2MarySwanson
1 password1
2 password2
\\\1 block of data name\\\
1.ABCD
2.FEGH
3.ZEY
\\\2-nd block of data name\\\
1. 123232aDDF dkfjd ksksd
2. dfdfsf dkfjd
....
etc
Suppose, it comes from a text buffer (plain file).
Each line of text is "\n" - limited. Space is used between the words.
The structure is more or less defined. Ambuguity may sometimes be, though, case
number of fields in each line of information may be different, sometimes there may not
be some block of data, and the number of lines in each block may vary as well.
The question is how to do it most effectively?
First solution that comes to my head is to use regular expressions.
But are there other solutions? Problem-oriented? Maybe some java library already written?
Check out UTAH: https://github.com/sonalake/utah-parser
It's a tool that's pretty good at parsing this kind of semi structured text
As no one recommended any library, my suggestion would be : use REGEX.
From what you have posted it looks like the data is delimited by whitespace. One idea is to use a Scanner or a StringTokenizer to get one token at a time. You can then check the first char of a token to see if it is a digit (in which case the part of the token after the digit(s) will be the data, if there is any).
This sounds like a homework problem so I'm going to try to answer it in such a way to help guide you (not give the final solution).
First, you need to consider each object of data you're reading. Is it a number then a text field? A number then 3 text fields? Variable numbers and text fields?
After that you need to determine what you're going to use to delimit each field and each object. For example, in many files you'll see something like a semi-colon between the fields and a new line for the end of the object. From what you said it sounds like yours is different.
If an object can go across multiple lines you'll need to bear that in mind (don't stop partway through an object).
Hopefully that helps. If you research this and you're still having problems post the code you've got so far and some sample data and I'll help you to solve your problems (I'll teach you to fish....not give you fish :-) ).
If the fields are fixed length, you could use a DataInputStream to read your file. Or, since your format is line-based, you could use a BufferedReader to read lines and write yourself a state machine which knows what kind of line to expect next, given what it's already seen. Once you have each line as a string, then you just need to split the data appropriately.
E.g., the password can be gotten from your password line like this:
final int pos = line.indexOf(' ');
String passwd = line.substring(pos+1, line.length());