Parsing of data structure in a plain text file - java

How would you parse in Java a structure, similar to this
\\Header (name)\\\
1JohnRide 2MarySwanson
1 password1
2 password2
\\\1 block of data name\\\
1.ABCD
2.FEGH
3.ZEY
\\\2-nd block of data name\\\
1. 123232aDDF dkfjd ksksd
2. dfdfsf dkfjd
....
etc
Suppose, it comes from a text buffer (plain file).
Each line of text is "\n" - limited. Space is used between the words.
The structure is more or less defined. Ambuguity may sometimes be, though, case
number of fields in each line of information may be different, sometimes there may not
be some block of data, and the number of lines in each block may vary as well.
The question is how to do it most effectively?
First solution that comes to my head is to use regular expressions.
But are there other solutions? Problem-oriented? Maybe some java library already written?

Check out UTAH: https://github.com/sonalake/utah-parser
It's a tool that's pretty good at parsing this kind of semi structured text

As no one recommended any library, my suggestion would be : use REGEX.

From what you have posted it looks like the data is delimited by whitespace. One idea is to use a Scanner or a StringTokenizer to get one token at a time. You can then check the first char of a token to see if it is a digit (in which case the part of the token after the digit(s) will be the data, if there is any).

This sounds like a homework problem so I'm going to try to answer it in such a way to help guide you (not give the final solution).
First, you need to consider each object of data you're reading. Is it a number then a text field? A number then 3 text fields? Variable numbers and text fields?
After that you need to determine what you're going to use to delimit each field and each object. For example, in many files you'll see something like a semi-colon between the fields and a new line for the end of the object. From what you said it sounds like yours is different.
If an object can go across multiple lines you'll need to bear that in mind (don't stop partway through an object).
Hopefully that helps. If you research this and you're still having problems post the code you've got so far and some sample data and I'll help you to solve your problems (I'll teach you to fish....not give you fish :-) ).

If the fields are fixed length, you could use a DataInputStream to read your file. Or, since your format is line-based, you could use a BufferedReader to read lines and write yourself a state machine which knows what kind of line to expect next, given what it's already seen. Once you have each line as a string, then you just need to split the data appropriately.
E.g., the password can be gotten from your password line like this:
final int pos = line.indexOf(' ');
String passwd = line.substring(pos+1, line.length());

Related

Best way to store words for given scenerio

I am working on Java project [Maven].
I am confused in one point. I don't know what is logiclaly corect.
Problem is as follows :-
Sentence is given, and from their I have extract some particular words.
Solution that I found
I make one regex and put in Constants class. Whenever I have to add more words, I simply appended words in regex.
This solves the problem.
I am confused here
I am thinking, if I put numbers of text files in resources folder where each text file denotes one regex expression.
REGEX = (?:A|B|C|D)
A, B, C, D = Word(String)
Is it a good idea ? If not please suggest any other.
Why would you save regex's in a text file? The fact that you're using a regex seems like an implementation detail that you would want to encapsulate (unless you want the significantly greater functionality but also overhead of supporting regexes).
Also, why do you need new files for each word? That seems like you could just have one file with a word per line that is all of the words you're interested in. This would be much more simple for a user to understand than 100 files with one regex per file.
As my understanding, you want to find some key words from the input string. And those key words could be extened according your requirments.
your current solution is to make this regex (?:A|B|C|D) in your Constant class, wheneveer it's required, you'll add more key words in this regex.
If my understanding is not wrong, maybe, one suggestion is to put this regex in your properties file, like this
REGEX = (?:city|Animal|plant|student)
if too long, it's could be like this
REGEX = (?:city|Animal|plant|student|car|computer|clothes|\
furnature|others)
Your second idea, if my understanding is not wrong, is to put the keywords as the file name, and those files are put in one resource folder. therefore, you could obtain those files name to compose the final regexp. If your regex are always fixed as the (?:A|B|C|D) format, then this solution is good & convenient. (Every time, you add one new keyword file, you don't need to modify any source code & property file)

Java library for cleaning up user-entered title to make it show up in a URL?

I am doing a web application. I would like to have a SEO-friendly link such as the following:
http://somesite.org/user-entered-title
The above user-entered-title is extracted from user-created records that have a field called title.
I am wondering whether there is any Java library for cleaning up such user-entered text (remove spaces, for example) before displaying it in a URL.
My target text is something such as "stackoverflow-is-great" after cleanup from user-entered "stackoverflow is great".
I am able to write code to replace spaces in a string with dashes, but not sure what are other rules/ideas/best practices out there for making text part of a url.
Please note that user-entered-title may be in different languages, not just English.
Thanks for any input and pointers!
Regards.
What you want is some kind of "SLUGifying" the prhase into a URL, so it is SEO-friendly.
Once I had that problem, I came to use a solution provided in maddemcode.com. Below you'll find its adapted code.
The trick is to properly use the Normalize JDK class with some little additional cleanup. The usage is simple:
// casingchange-aeiouaeiou-takesexcess-spaces
System.out.println(slugify("CaSiNgChAnGe áéíóúâêîôû takesexcess spaces "));
// these-are-good-special-characters-sic
System.out.println(slugify("These are good Special Characters šíč"));
// some-exceptions-123-aeiou
System.out.println(slugify(" some exceptions ¥123 ã~e~iõ~u!##$%¨&*() "));
// gonna-accomplish-yadda
System.out.println(slugify("gonna accomplish, yadda, 완수하다, 소양양)이 있는 "));
Function code:
public static String slugify(String input) {
return Normalizer.normalize(input, Normalizer.Form.NFD)
.replaceAll("[^\\p{ASCII}]", "")
.replaceAll("[^ \\w]", "").trim()
.replaceAll("\\s+", "-").toLowerCase(Locale.ENGLISH);
}
In the source page (http://maddemcode.com/java/seo-friendly-urls-using-slugify-in-java/) you can take a look at where this comes from. The small snippet above, though, works the same.
As you can see, there are some exceptional chars that aren't converted. To my knowledge, everyone that translates them, uses some kind of map, like Djago's urlify (see example map here). You need them, I believe your best bet is making one.
It seems you want to URL-encode a string. It's possible in core Java, without using external libraries. URLEncoder is the class you need.
Languages other than English shouldn't be a problem as the class allows you to specify the character encoding, which takes care of special characters like accents, etc.

Normalizing/unaccenting text in Java

How can I normalize/unaccent text in Java? I am currently using java.text.Normalizer:
Normalizer.normalize(str, Normalizer.Form.NFD)
.replaceAll("\\p{InCombiningDiacriticalMarks}+", "")
But it is far from perfect. For example, it leaves Norwegian characters æ and ø untouched. Does anyone know of an alternative? I am looking for something that would convert characters in all sorts of languages to just the a-z range. I realize there are different ways to do this (e.g. should æ be encoded as 'a', 'e' or even 'ae'?) and I'm open for any solution. I prefer to not write something myself since I think it's unlikely that I will be able to do this well for all languages. Performance is NOT critical.
The use case: I want to convert a user entered name to a plain a-z ranged name. The converted name will be displayed to the user, so I want it to match as close as possible what the user wrote in his original language.
EDIT:
Alright people, thanks for negging the post and not addressing my question, yay! :) Maybe I should have left out the use case. But please allow me to clarify. I need to convert the name in order to store it internally. I have no control over the choice of letters allowed here. The name will be visible to the user in, for example, the URL. The same way that your user name on this forum is normalized and shown to you in the URL if you click on your name. This forum converts a name like "Bășan" to "baan" and a name like "Øyvind" to "yvind". I believe it can be done better. I am looking for ideas and preferably a library function to do this for me. I know I can not get it right, I know that "o" and "ø" are different, etc, but if my name is "Øyvind" and I register on an online forum, I would likely prefer that my user name is "oyvind" and not "yvind". Hope that this makes any sense! Thanks!
(And NO, we will not allow the user to pick his own user name. I am really just looking for an alternative to java.text.Normalizer. Thanks!)
Assuming you have considering ALL of the implications of what you're doing, ALL the ways it can go wrong, what you'll do when you get Chinese pictograms and other things that have no equivalent in the Latin Alphabet...
There's not a library that I know of that does what you want. If you have a list of equivalencies (as you say, the 'æ' to 'ae' or whatever), you could store them in a file (or, if you're doing this a lot, in a sorted array in memory, for performance reason) and then do a lookup and replace by character. If you have the space in memory to store the (# of unicode characters) as a char array, being able to run through the unicode values of each character and do a straight lookup would be the most efficient.
i.e., /u1234 => lookupArray[1234] => 'q'
or whatever.
so you'll have a loop that looks like:
StringBuffer buf = new StringBuffer();
for (int i = 0; i < string.length(); i++) {
buf.append(lookupArray[Character.unicodeValue(string.charAt(i))]);
}
I wrote that from scratch, so there are probably some bad method calls or something.
You'll have to do something to handle decomposed characters, probably with a lookahead buffer.
Good luck - I'm sure this is fraught with pitfalls.

Is there a tool to take a block of text and turn it into Java StringBuffer code?

I have several blocks of text that I need to be able to paste inline in my code for some unit tests. It would make the code difficult to read if they were externalized, so is there some web tool where I can paste in my text and it will generate the code for a StringBuffer that preserves it's formatting? Or even a String, I'm not that picky at this point.
This seems like a code generator like this must exist somewhere on the web. I tried to Google one, but I have yet to come up with a set of search terms that don't fill my results with Java examples and documentation.
I suppose I could write one myself, but I'm in a bit of a time crunch and would rather not duplicate effort.
If I understood it correctly, any text editor which supports regexps should make it an easy task. For instance Notepad++ - just replace ^(.+)$ with "\1"+, then copy the result to the code, remove the last + and add String s = to the beginning :)
If you want to externalize then, use a properties file or something like that to read the text.
If you are looking for a simple tool to break up your text into concatenated strings that are joined together by stringbuffer then, most modern IDE will help you do it automatically. Here's how.
Copy the block of text in the IDE
Surround it in double quotes and assign to a String type variable. (This step may not be required)
Enter carriage returns wherever you want to wrap the text to next line and the IDE will automatically break the literals, concatenate them using double quotes "" and add them together
All modern compilers will internally convert "addas" + "addasfdas" literals to a String using StringBuffer.
The squirrel SQL client has a function called convert to string buffer it works nice.

Need some ideas on how to acomplish this in Java (parsing strings)

Sorry I couldn't think of a better title, but thanks for reading!
My ultimate goal is to read a .java file, parse it, and pull out every identifier. Then store them all in a list. Two preconditions are there are no comments in the file, and all identifiers are composed of letters only.
Right now I can read the file, parse it by spaces, and store everything in a list. If anything in the list is a java reserved word, it is removed. Also, I remove any loose symbols that are not attached to anything (brackets and arithmetic symbols).
Now I am left with a bunch of weird strings, but at least they have no spaces in them. I know I am going to have to re-parse everything with a . delimiter in order to pull out identifiers like System.out.print, but what about strings like this example:
Logger.getLogger(MyHash.class.getName()).log(Level.SEVERE,
After re-parsing by . I will be left with more crazy strings like:
getLogger(MyHash
getName())
log(Level
SEVERE,
How am I going to be able to pull out all the identifiers while leaving out all the trash? Just keep re-parsing by every symbol that could exist in java code? That seems rather lame and time consuming. I am not even sure if it would work completely. So, can you suggest a better way of doing this?
There are several solutions that you can use, other than hacking your-own parser:
Use an existing parser, such as this one.
Use BCEL to read bytecode, which includes all fields and variables.
Hack into the compiler or run-time, using annotation processing or mirrors - I'm not sure you can find all identifiers this way, but fields and parameters for sure.
I wouldn't separate the entire file at once according to whitespace. Instead, I would scan the file letter-by-letter, saving every character in a buffer until I'm sure an identifier has been reached.
In pseudo-code:
clean buffer
for each letter l in file:
if l is '
toggle "character mode"
if l is "
toggle "string mode"
if l is a letter AND "character mode" is off AND "string mode" is off
add l to end of buffer
else
if buffer is NOT a keyword or a literal
add buffer to list of identifiers
clean buffer
Notice some lines here hide further complexity - for example, to check if the buffer is a literal you need to check for both true, false, and null.
In addition, there are more bugs in the pseudo-code - it will find identify things like the e and L parts of literals (e in floating-point literals, L in long literals) as well. I suggest adding additional "modes" to take care of them, but it's a bit tricky.
Also there are a few more things if you want to make sure it's accurate - for example you have to make sure you work with unicode. I would strongly recommend investigating the lexical structure of the language, so you won't miss anything.
EDIT:
This solution can easily be extended to deal with identifiers with numbers, as well as with comments.
Small bug above - you need to handle \" differently than ", same with \' and '.
Wow, ok. Parsing is hard -- really hard -- to do right. Rolling your own java parser is going to be incredibly difficult to do right. You'll find there are a lot of edge cases you're just not prepared for. To really do it right, and handle all the edge cases, you'll need to write a real parser. A real parser is composed of a number of things:
A lexical analyzer to break the input up into logical chunks
A grammar to determine how to interpret the aforementioned chunks
The actual "parser" which is generated from the grammar using a tool like ANTLR
A symbol table to store identifiers in
An abstract syntax tree to represent the code you've parsed
Once you have all that, you can have a real parser. Of course you could skip the abstract syntax tree, but you need pretty much everything else. That leaves you with writing about 1/3 of a compiler. If you truly want to complete this project yourself, you should see if you can find an example for ANTLR which contains a preexisting java grammar definition. That'll get you most of the way there, and then you'll need to use ANTLR to fill in your symbol table.
Alternately, you could go with the clever solutions suggested by Little Bobby Tables (awesome name, btw Bobby).

Categories