Identifying points in lines of text

Identifying points in lines of text - java

I have a java program that reads lines of a text file into a buffer and when the buffer is full it outputs the lines so that after all lines have been through the buffer the output is partially sorted.
The output will be in blocks of lines so I need a way to mark the end of each block in the output. Since the output is lines of text I'm not sure what character to use as a marker since the text can contain any characters. I'm thinking of using the ascii null or unit separator but I'm not sure if this would be reliable since it could also be in text.

You could use a Map, so you can set a key for every buffergroup something like that
Hash<int,Buffer> myMap = new HashMap<>();

if you are not sure how to discriminate lines, I suggest you take a look at a sentence tokenizer tool which is usually used in NLP. These programs contain patterns that discriminate lines from each other. That way, you can send all your date through and get the lines without worrying about wich character to use. There are plenty libraries for Java which does the job perfectly (Assuming your text is in English)

Related

Java parsing text file

I need to write a parser for textfiles (at least 20 kb), and I need to determine if words out of a set of words appear in this textfile (about 400 words and numbers). So I am looking for the most efficient possibilitie to do this (if a match is found, i need to do some further processing of this and it's previous line).
What I currently do, is to exclude lines that do not contain any information for sure (kind of metadata lines) and then compare word by word - but i don't think that only comparing word by word is the most efficient possibility.
Can anyone please provide some tips/hints/ideas/...
Thank you very much

It depends on what you mean with "efficient".
If you want a very straightforward way to code it, keep in mind that the String object in java has method String.contains(CharSequence sequence).
Then, you could put the file content into a String and then iterate on your keywords you want to check to see if any of those appear in String, using the method contains().

How about the following:
Put all your keywords in a HashSet (Set<String> keywords;)
Read the file one line at once
For each line in file:
Tokenize to words
For each word in line:
If word is contained in keywords (keywords.containes(word))
Process actual line
If previous line is available
Process previous line
Keep track of previous line (prevLine = line;)

Processing paraphragraphs in text files as single records with Hadoop

Simplifying my problem a bit, I have a set of text files with "records" that are delimited by double newline characters. Like
'multiline text'
'empty line'
'multiline text'
'empty line'
and so forth.
I need to transform each multiline unit separately and then perform mapreduce on them.
However, I am aware that with the default wordcount setting in the hadoop code boilerplate, the input to the value variable in the following function is just a single line and there are no guarantees that the input is contiguous with the previous input line.
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException ;
And I need it to be that the input value is actually one unit of the double newline delimited multiline text.
Some searching turned up a RecordReader class and a getSplits method but no simple code examples that I could wrap my head around.
An alternative solution is to just replace all newline characters in the multiline text with space characters and be done with it. I'd rather not do this because there's quite a bit of text and it's time consuming in terms of runtime. I also have to modify a lot of code if I do this so dealing with it through hadoop would be most attractive for me.

If your files are small in size, then they won't get split. Essentially each file is one split assigned to one mapper instance. In this case, I agree with Thomas. You can build your logical record in your mapper class, by concatenating strings. You can detect your record boundary by looking for an empty string coming in as value to your mapper.
However, if the files are big and get split, then I don't see any other option but to implement your own text input format class. You could clone existing Hadoop LineRecordReader and LineReader java classes. You have to make a small change in your version of LineReader class so that the record delimiter will be two new lines, instead of one. Once this done, your mapper will receive multiple lines as input value.

What's the problem with it? Just put the previous lines into a StringBuilder and flush it when you reach a new record.
When you are using textfiles, they won't get split. For these cases it uses FileInputFormat, which only parallelizes to the number of files available.

Java File Splitting

What will be the most eficient way to split a file in Java ?
Like to get it grid ready...
(Edit)
Modifying the question.
Basically after scouring the net I understand that there are generally two methods followed for file splitting....
Just split them by the number of bytes
I guess the advantage of this method is that it is fast, but say I have all the data in a line and suppose the file split puts half the data in one split and the other half the data in another split, then what do I do ??
Read them line by line
This will keep my data intact, fine, but I suppose this ain't as fast as the above method

Well, just read the file line by line and start saving it to a new file. Then when you decide it's time to split, start saving the lines to a new place.
Don't worry about efficiency too much unless it's a real problem later.

My first impression is that you have something like a comma separated value (csv) file. The usual way to read / parse those files is to
read them line by line
skip headers and empty lines
use String#split(String reg) to split a line into values (reg is chosen to match the delimiter)

Parsing of data structure in a plain text file

How would you parse in Java a structure, similar to this
\\Header (name)\\\
1JohnRide 2MarySwanson
1 password1
2 password2
\\\1 block of data name\\\
1.ABCD
2.FEGH
3.ZEY
\\\2-nd block of data name\\\
1. 123232aDDF dkfjd ksksd
2. dfdfsf dkfjd
....
etc
Suppose, it comes from a text buffer (plain file).
Each line of text is "\n" - limited. Space is used between the words.
The structure is more or less defined. Ambuguity may sometimes be, though, case
number of fields in each line of information may be different, sometimes there may not
be some block of data, and the number of lines in each block may vary as well.
The question is how to do it most effectively?
First solution that comes to my head is to use regular expressions.
But are there other solutions? Problem-oriented? Maybe some java library already written?

Check out UTAH: https://github.com/sonalake/utah-parser
It's a tool that's pretty good at parsing this kind of semi structured text

As no one recommended any library, my suggestion would be : use REGEX.

From what you have posted it looks like the data is delimited by whitespace. One idea is to use a Scanner or a StringTokenizer to get one token at a time. You can then check the first char of a token to see if it is a digit (in which case the part of the token after the digit(s) will be the data, if there is any).

This sounds like a homework problem so I'm going to try to answer it in such a way to help guide you (not give the final solution).
First, you need to consider each object of data you're reading. Is it a number then a text field? A number then 3 text fields? Variable numbers and text fields?
After that you need to determine what you're going to use to delimit each field and each object. For example, in many files you'll see something like a semi-colon between the fields and a new line for the end of the object. From what you said it sounds like yours is different.
If an object can go across multiple lines you'll need to bear that in mind (don't stop partway through an object).
Hopefully that helps. If you research this and you're still having problems post the code you've got so far and some sample data and I'll help you to solve your problems (I'll teach you to fish....not give you fish :-) ).

If the fields are fixed length, you could use a DataInputStream to read your file. Or, since your format is line-based, you could use a BufferedReader to read lines and write yourself a state machine which knows what kind of line to expect next, given what it's already seen. Once you have each line as a string, then you just need to split the data appropriately.
E.g., the password can be gotten from your password line like this:
final int pos = line.indexOf(' ');
String passwd = line.substring(pos+1, line.length());

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Identifying points in lines of text - java

You could use a Map, so you can set a key for every buffergroup something like that Hash<int,Buffer> myMap = new HashMap<>();

Related

Suggested ways of reading a text file with inconsistent formatting

Java parsing text file

Processing paraphragraphs in text files as single records with Hadoop

Java File Splitting

Parsing of data structure in a plain text file

Categories

Resources