Best way to store words for given scenerio

Best way to store words for given scenerio - java

I am working on Java project [Maven].
I am confused in one point. I don't know what is logiclaly corect.
Problem is as follows :-
Sentence is given, and from their I have extract some particular words.
Solution that I found
I make one regex and put in Constants class. Whenever I have to add more words, I simply appended words in regex.
This solves the problem.
I am confused here
I am thinking, if I put numbers of text files in resources folder where each text file denotes one regex expression.
REGEX = (?:A|B|C|D)
A, B, C, D = Word(String)
Is it a good idea ? If not please suggest any other.

Why would you save regex's in a text file? The fact that you're using a regex seems like an implementation detail that you would want to encapsulate (unless you want the significantly greater functionality but also overhead of supporting regexes).
Also, why do you need new files for each word? That seems like you could just have one file with a word per line that is all of the words you're interested in. This would be much more simple for a user to understand than 100 files with one regex per file.

As my understanding, you want to find some key words from the input string. And those key words could be extened according your requirments.
your current solution is to make this regex (?:A|B|C|D) in your Constant class, wheneveer it's required, you'll add more key words in this regex.
If my understanding is not wrong, maybe, one suggestion is to put this regex in your properties file, like this
REGEX = (?:city|Animal|plant|student)
if too long, it's could be like this
REGEX = (?:city|Animal|plant|student|car|computer|clothes|\
furnature|others)
Your second idea, if my understanding is not wrong, is to put the keywords as the file name, and those files are put in one resource folder. therefore, you could obtain those files name to compose the final regexp. If your regex are always fixed as the (?:A|B|C|D) format, then this solution is good & convenient. (Every time, you add one new keyword file, you don't need to modify any source code & property file)

Related

Find words in multiple files starting with a specific set of characters and replace the whole word with another word

I need to read through multiple files and check for all occurrences of words that start with a specific pattern and replace it in all the files with another word. For example, I need to find all words beginning with 'Hello' in a set of files which may contain words like 'Hellotoall' and then I want the word to be replaced with 'Greetings', just an example. I have tried:
content = content.replaceAll("/Hello(\\w)+/g", "Greetings");
This code results in : Greetingstoall, but I want the whole word to be replaced with 'Greetings', i.e. if the file has a line:
Today i say Hellotoall present here. After replacement the line should be like: Today i say Greetings present here.
How can I achieve such a requirement with a better regex.

You need just "Hello(\\w)*".

isn't the output Greetingsoall? The match would be Hellot - so first thing is that you may want to replace + with *
As talex pointed out, there is sed syntax mixed in, which doesn't work with Java.
content.replaceAll("Hello\w*", "Greetings")

grouping files in java by file name

I have a set of files, thousands of files. Let's say this files are like :
abc a
abc bnd nm
abc vcb
abc
abc something
nmn as
nmn af
nmn bvf
I need to group those files. I need to group this files by partial name match. So, in this example I will have 2 groups. One will be the group [abc] and the group [nmn]. Any suggestions?

Edit: Turns out there's a method in String that makes this much easier than regex: String.startsWith(String prefix)
http://docs.oracle.com/javase/7/docs/api/java/lang/String.html#startsWith(java.lang.String)
Regex is useful, but overkill for something like this. My bad for overthinking this at first...
(old answer below)
Sounds like a job for regex!
http://docs.oracle.com/javase/7/docs/api/java/util/regex/package-summary.html
String.matches() would help too.
Basically, you'll want to create regexes that match the first sections of your file names and then anything else following those. In the example in your question, example regexes would be "abc." for abc______ and "nmn." for nmn_____. These probably aren't 100% correct syntax, but that's the general idea. The rules in the link (look at the Pattern class) would give you pretty much all you need.
What you'd do is create two new Sets, one for each prefix. Then loop through the original set, and based on the regex put the file name into one set or the other.

create a hash map of String as key and list of File(in case you want to store file object) or String(in case file name)
and check if name starts with Anti put it in group 1 with the key whatever you like
starts with Philips put in group 2 with the key whatever you like

Using regex to match String to name in external file

I am writing a program that is supposed to detect whether a type name is contained within an external file. For example, if a string is equal to "anitasugarland" and the external file contains the name "ANITA" then is there any way to confirm if there is a name match? The problem I'm having is if I just use Java's "startsWith" then it matches on other names like "An" or other names. As you can see this can cause inaccuracies. So is there a way using regex or word boundaries to check if a first name in the string matches the one in the external file? As of now this really has me stumped. If someone could take a look at this or provide a possible solution I would very much appreciate it!
Thank you!

Need some ideas on how to acomplish this in Java (parsing strings)

Sorry I couldn't think of a better title, but thanks for reading!
My ultimate goal is to read a .java file, parse it, and pull out every identifier. Then store them all in a list. Two preconditions are there are no comments in the file, and all identifiers are composed of letters only.
Right now I can read the file, parse it by spaces, and store everything in a list. If anything in the list is a java reserved word, it is removed. Also, I remove any loose symbols that are not attached to anything (brackets and arithmetic symbols).
Now I am left with a bunch of weird strings, but at least they have no spaces in them. I know I am going to have to re-parse everything with a . delimiter in order to pull out identifiers like System.out.print, but what about strings like this example:
Logger.getLogger(MyHash.class.getName()).log(Level.SEVERE,
After re-parsing by . I will be left with more crazy strings like:
getLogger(MyHash
getName())
log(Level
SEVERE,
How am I going to be able to pull out all the identifiers while leaving out all the trash? Just keep re-parsing by every symbol that could exist in java code? That seems rather lame and time consuming. I am not even sure if it would work completely. So, can you suggest a better way of doing this?

There are several solutions that you can use, other than hacking your-own parser:
Use an existing parser, such as this one.
Use BCEL to read bytecode, which includes all fields and variables.
Hack into the compiler or run-time, using annotation processing or mirrors - I'm not sure you can find all identifiers this way, but fields and parameters for sure.

I wouldn't separate the entire file at once according to whitespace. Instead, I would scan the file letter-by-letter, saving every character in a buffer until I'm sure an identifier has been reached.
In pseudo-code:
clean buffer
for each letter l in file:
if l is '
toggle "character mode"
if l is "
toggle "string mode"
if l is a letter AND "character mode" is off AND "string mode" is off
add l to end of buffer
else
if buffer is NOT a keyword or a literal
add buffer to list of identifiers
clean buffer
Notice some lines here hide further complexity - for example, to check if the buffer is a literal you need to check for both true, false, and null.
In addition, there are more bugs in the pseudo-code - it will find identify things like the e and L parts of literals (e in floating-point literals, L in long literals) as well. I suggest adding additional "modes" to take care of them, but it's a bit tricky.
Also there are a few more things if you want to make sure it's accurate - for example you have to make sure you work with unicode. I would strongly recommend investigating the lexical structure of the language, so you won't miss anything.
EDIT:
This solution can easily be extended to deal with identifiers with numbers, as well as with comments.
Small bug above - you need to handle \" differently than ", same with \' and '.

Wow, ok. Parsing is hard -- really hard -- to do right. Rolling your own java parser is going to be incredibly difficult to do right. You'll find there are a lot of edge cases you're just not prepared for. To really do it right, and handle all the edge cases, you'll need to write a real parser. A real parser is composed of a number of things:
A lexical analyzer to break the input up into logical chunks
A grammar to determine how to interpret the aforementioned chunks
The actual "parser" which is generated from the grammar using a tool like ANTLR
A symbol table to store identifiers in
An abstract syntax tree to represent the code you've parsed
Once you have all that, you can have a real parser. Of course you could skip the abstract syntax tree, but you need pretty much everything else. That leaves you with writing about 1/3 of a compiler. If you truly want to complete this project yourself, you should see if you can find an example for ANTLR which contains a preexisting java grammar definition. That'll get you most of the way there, and then you'll need to use ANTLR to fill in your symbol table.
Alternately, you could go with the clever solutions suggested by Little Bobby Tables (awesome name, btw Bobby).

Parsing of data structure in a plain text file

How would you parse in Java a structure, similar to this
\\Header (name)\\\
1JohnRide 2MarySwanson
1 password1
2 password2
\\\1 block of data name\\\
1.ABCD
2.FEGH
3.ZEY
\\\2-nd block of data name\\\
1. 123232aDDF dkfjd ksksd
2. dfdfsf dkfjd
....
etc
Suppose, it comes from a text buffer (plain file).
Each line of text is "\n" - limited. Space is used between the words.
The structure is more or less defined. Ambuguity may sometimes be, though, case
number of fields in each line of information may be different, sometimes there may not
be some block of data, and the number of lines in each block may vary as well.
The question is how to do it most effectively?
First solution that comes to my head is to use regular expressions.
But are there other solutions? Problem-oriented? Maybe some java library already written?

Check out UTAH: https://github.com/sonalake/utah-parser
It's a tool that's pretty good at parsing this kind of semi structured text

As no one recommended any library, my suggestion would be : use REGEX.

From what you have posted it looks like the data is delimited by whitespace. One idea is to use a Scanner or a StringTokenizer to get one token at a time. You can then check the first char of a token to see if it is a digit (in which case the part of the token after the digit(s) will be the data, if there is any).

This sounds like a homework problem so I'm going to try to answer it in such a way to help guide you (not give the final solution).
First, you need to consider each object of data you're reading. Is it a number then a text field? A number then 3 text fields? Variable numbers and text fields?
After that you need to determine what you're going to use to delimit each field and each object. For example, in many files you'll see something like a semi-colon between the fields and a new line for the end of the object. From what you said it sounds like yours is different.
If an object can go across multiple lines you'll need to bear that in mind (don't stop partway through an object).
Hopefully that helps. If you research this and you're still having problems post the code you've got so far and some sample data and I'll help you to solve your problems (I'll teach you to fish....not give you fish :-) ).

If the fields are fixed length, you could use a DataInputStream to read your file. Or, since your format is line-based, you could use a BufferedReader to read lines and write yourself a state machine which knows what kind of line to expect next, given what it's already seen. Once you have each line as a string, then you just need to split the data appropriately.
E.g., the password can be gotten from your password line like this:
final int pos = line.indexOf(' ');
String passwd = line.substring(pos+1, line.length());

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Best way to store words for given scenerio - java

Related

Find words in multiple files starting with a specific set of characters and replace the whole word with another word

grouping files in java by file name

Using regex to match String to name in external file

Need some ideas on how to acomplish this in Java (parsing strings)

Parsing of data structure in a plain text file

Categories

Resources