I need to write a parser for textfiles (at least 20 kb), and I need to determine if words out of a set of words appear in this textfile (about 400 words and numbers). So I am looking for the most efficient possibilitie to do this (if a match is found, i need to do some further processing of this and it's previous line).
What I currently do, is to exclude lines that do not contain any information for sure (kind of metadata lines) and then compare word by word - but i don't think that only comparing word by word is the most efficient possibility.
Can anyone please provide some tips/hints/ideas/...
Thank you very much
It depends on what you mean with "efficient".
If you want a very straightforward way to code it, keep in mind that the String object in java has method String.contains(CharSequence sequence).
Then, you could put the file content into a String and then iterate on your keywords you want to check to see if any of those appear in String, using the method contains().
How about the following:
Put all your keywords in a HashSet (Set<String> keywords;)
Read the file one line at once
For each line in file:
Tokenize to words
For each word in line:
If word is contained in keywords (keywords.containes(word))
Process actual line
If previous line is available
Process previous line
Keep track of previous line (prevLine = line;)
Related
I need to read through multiple files and check for all occurrences of words that start with a specific pattern and replace it in all the files with another word. For example, I need to find all words beginning with 'Hello' in a set of files which may contain words like 'Hellotoall' and then I want the word to be replaced with 'Greetings', just an example. I have tried:
content = content.replaceAll("/Hello(\\w)+/g", "Greetings");
This code results in : Greetingstoall, but I want the whole word to be replaced with 'Greetings', i.e. if the file has a line:
Today i say Hellotoall present here. After replacement the line should be like: Today i say Greetings present here.
How can I achieve such a requirement with a better regex.
You need just "Hello(\\w)*".
isn't the output Greetingsoall? The match would be Hellot - so first thing is that you may want to replace + with *
As talex pointed out, there is sed syntax mixed in, which doesn't work with Java.
content.replaceAll("Hello\w*", "Greetings")
I've been currently assigned to read from a .txt, and make a structure with what I've read. This is an example of what I should read:
Name1$Surname1$Programming&5.0#Mathematics&6.5#Algebra&7.2#History&6.7#Biology&6.9
I have no problems whatsoever when it comes to read the first two strings, however, from that point on i don't know how to manage, in order to properly split it and make new objects with them.
Any tips/ suggestions on how to do it pls?
Weird structure.
Split at '$'
First element of that split is the Name, second the Surname.
Split the third element at '#'.
Split each element of the result of step 3 again at '&' to get course and grade.
See here how to split strings.
I am working on Java project [Maven].
I am confused in one point. I don't know what is logiclaly corect.
Problem is as follows :-
Sentence is given, and from their I have extract some particular words.
Solution that I found
I make one regex and put in Constants class. Whenever I have to add more words, I simply appended words in regex.
This solves the problem.
I am confused here
I am thinking, if I put numbers of text files in resources folder where each text file denotes one regex expression.
REGEX = (?:A|B|C|D)
A, B, C, D = Word(String)
Is it a good idea ? If not please suggest any other.
Why would you save regex's in a text file? The fact that you're using a regex seems like an implementation detail that you would want to encapsulate (unless you want the significantly greater functionality but also overhead of supporting regexes).
Also, why do you need new files for each word? That seems like you could just have one file with a word per line that is all of the words you're interested in. This would be much more simple for a user to understand than 100 files with one regex per file.
As my understanding, you want to find some key words from the input string. And those key words could be extened according your requirments.
your current solution is to make this regex (?:A|B|C|D) in your Constant class, wheneveer it's required, you'll add more key words in this regex.
If my understanding is not wrong, maybe, one suggestion is to put this regex in your properties file, like this
REGEX = (?:city|Animal|plant|student)
if too long, it's could be like this
REGEX = (?:city|Animal|plant|student|car|computer|clothes|\
furnature|others)
Your second idea, if my understanding is not wrong, is to put the keywords as the file name, and those files are put in one resource folder. therefore, you could obtain those files name to compose the final regexp. If your regex are always fixed as the (?:A|B|C|D) format, then this solution is good & convenient. (Every time, you add one new keyword file, you don't need to modify any source code & property file)
I have a text file with thousands and thousands of lines of gibberish, Hidden somewhere inside is a string of words in english.
What would be the most efficient way to search through the text without having to read it line by line?
Is there a script I could write to read through the file?
I can post the file if anyones interested?
edit: If someone would be willing to show me how to check for words with a BufferedReader in Java that would be really cool!
If you know nothing more than that there is one streak of valid english words somewhere in the file, you will have to read in the file and check each word against a set of valid words (dictionary). On the first hit, you continue to read in the file until the first non-valid word occurs.
This assumes there are no accidentally valid words within the gibberish. In that case, you would have to find all streaks of valid words, and then probably have a human (you) decide which is the right one.
edit: another thing you can do is define a minimum streak length n if you know that the string of words you are looking for consists of a minimum on n valid words. This could at least spare you dealing with all the false positive 1-word-streaks of single accidentally valid words within the gibberish.
What will be the most eficient way to split a file in Java ?
Like to get it grid ready...
(Edit)
Modifying the question.
Basically after scouring the net I understand that there are generally two methods followed for file splitting....
Just split them by the number of bytes
I guess the advantage of this method is that it is fast, but say I have all the data in a line and suppose the file split puts half the data in one split and the other half the data in another split, then what do I do ??
Read them line by line
This will keep my data intact, fine, but I suppose this ain't as fast as the above method
Well, just read the file line by line and start saving it to a new file. Then when you decide it's time to split, start saving the lines to a new place.
Don't worry about efficiency too much unless it's a real problem later.
My first impression is that you have something like a comma separated value (csv) file. The usual way to read / parse those files is to
read them line by line
skip headers and empty lines
use String#split(String reg) to split a line into values (reg is chosen to match the delimiter)