Cleaning a text with Regular expression

Cleaning a text with Regular expression - java

I am reading a text file and I want to find correct tokens of the text. But I have problem with dot at the end of the sentences.My code is the following code and query means input string:
query = query.replaceAll("[^\\p{L}\\s0-9-_/.]", "");
query = query.replaceAll("\t", " ");
query = query.replaceAll("\r", " ");
query = query.replaceAll("\n", " ");
StringTokenizer words = new StringTokenizer(query, " ");
while(tokens.hasMoreTokens()){
String str=tokens.nextToken();
String regex = "\\d+.\\d+";
if(!str.matches(regex)) *<- second problem*
System.out.println(str);
For example; the Input text is the following line
THE WORLD OF UNIQUE VENDING CARTS. fy_lkaris#yahoo.com www.ubc_lib?9867.come/homepage 876454 9890-9999-9099.
I want the following string as output
THE WORLD OF UNIQUE VENDING CARTS
fy_lkaris#yahoo.com
www.ubc_lib?9867.come/homepage
9890-9999-9099
But my real out put has dot at the end of first and last line of output.
I can not delete dot (.) since it delete from every place.
THE
WORLD
OF
UNIQUE
VENDING
CARTS.ff_lashkariyahoo.com *<-problem*
www.unb_lib9867.come/homepage
9890-9999-9099. *<-problem*
Also I want to delete only numbers like 4,764,90.900 not 76-098-098 and I could not find any better than useing match function .Is there any way to solve this problem too.
Could you please help me?

Problem is presence of unescaped hyphen in the middle of character class. A hyphen can be unescaped only when it is placed at start or end position inside character class.
Use this:
query = query.replaceAll("[^\\p{L}\\s0-9_/.-]", "");
When hyphen comes in the middle it acts as range. In your case it creating a range between digit 9 (ASCII: 57) and underscore (ASCII: 95).

I found a way for solving my problems. I changed my code to the following code and it works.
query = query.replaceAll("[^\\p{L}\\s0-9-_/.#]", "");
query = query.replaceAll("\t", " ");
query = query.replaceAll("\r", " ");
query = query.replaceAll("\n", " ");
StringTokenizer words = new StringTokenizer(query, " ");
while(tokens.hasMoreTokens()){
String str=tokens.nextToken();
str = str.replaceAll("\\.\\B" , " "); *<-new line*
String regex = "\\d+.\\d+";
if(!str.matches(regex)) *<- second problem*
System.out.println(str);

Related

cannot split a specific kind of strings using Java

I am working in Java. I have list of parameters stored in a string which is coming form excel. I want to split it only at starting hyphen of every new line. This string is stored in every excel cell and I am trying to extract it using Apache poi. The format is as below:
String text =
"- I am string one\n" +
"-I am string two\n" +
"- I am string-three\n" +
"with new line\n" +
"-I am string-four\n" +
"- I am string five";
What I want
array or arraylist which looks like this
[I am string one,
I am string two,
I am string-three with new line,
I am string-four,
I am string five]
What I Tried
I tried to use split function like this:
String[] newline_split = text.split("-");
but the output I get is not what I want
My O/P
[, I am string one,
I am string two,
I am string, // wrong
three // wrong
with new line, // wrong
I am string, // wrong!
four, // wrong!
I am string five]
I might have to tweak split function a bit but not able to understand how, because there are so many hyphens and new lines in the string.
P.S.
If i try splitting only at new line then the line - I am string-three \n with new line breaks into two parts which again is not correct.
EDIT:
Please know that this data inside string is incorrectly formatted just like what is shown above. It is coming from an excel file which I have received. I am trying to use apache poi to extract all the content out of each excel cell in a form of a string.
I intentionally tried to keep the format like what client gave me. For those who are confused about description inside A, I have changed it because I cannot post the contents on here as it is against privacy of my workplace.

You can
remove line separators (replace it with space) if they don't have - after it (in next line): .replaceAll("\\R(?!-)", " ") should do the trick
\R (written as "\\R" in string literal) since Java 8 can be used to represent line separators
(?!...) is negative-look-ahead mechanism - ensures that there is no - after place in which it was used (will not include it in match so we will not remove potential - which ware matched by it)
then remove - placed at start of each line (lets also include followed whitespaces to trim start of the string). In other words replace - placed
after line separators: can be represented by "\\R"
after start of string: can be represented by ^
This should do the trick: .replaceAll("(?<=\\R|^)-\\s*","")
split on remaining line separtors: .split("\\R")
Demo:
String text =
"- I am string one\n" +
"-I am string two\n" +
"- I am string-three\n" +
"with new line\n" +
"-I am string-four\n" +
"- I am string five";
String[] split = text.replaceAll("\\R(?!-)", " ")
.replaceAll("(?<=\\R|^)-\\s*","")
.split("\\R");
for (String s: split){
System.out.println("'"+s+"'");
}
Output (surrounded with ' to show start and end of results):
'I am string one'
'I am string two'
'I am string-three with new line'
'I am string-four'
'I am string five'

This is how I would do:
import java.util.*;
public class MyClass {
public static void main(String args[]) {
String A = "- I am string one \n" +
" -I am string two\n" +
" - I am string-three \n" +
" with new line\n" +
" -I am string-four\n" +
"- I am string five";
String[] s2 = A.split("\r?\n");
List<String> lines = new ArrayList<String>();
String line = "";
for (int i = 0; i < s2.length; i++) {
String ss = s2[i].trim();
if (i == 0) { // first line MUST start with "-"
line = ss.substring(1).trim();
} else if (ss.startsWith("-")) {
lines.add(line);
ss = ss.substring(1).trim();
line = ss;
} else {
line = line + " " + ss;
}
}
lines.add(line);
System.out.println(lines.toString());
}
}
I hope it helps.
A little explanation:
I will process line by line, trimming each one.
If it starts with '-' it means the end of the previous line, so I include it in the list. If not, I concatenate with the previous line.

looks as if you are splitting the FIRST - of each line, so you need to remove every instance of a "newline -"
str.replace("\n-", '\n')
then Remove the initial "-"
str = str.substring(1);

Parsing a Log File to Display Data from Multiple Lines Using Regular Expressions

So I'm trying to parse a bit of code here to get message text from a log file. I'll explain as I go. Here's the code:
// Print to interactions
try
{
// assigns the input file to a filereader object
BufferedReader infile = new BufferedReader(new FileReader(log));
sc = new Scanner(log);
while(sc.hasNext())
{
String line=sc.nextLine();
if(line.contains("LANTALK")){
Document doc = Jsoup.parse(line);
Element idto = doc.select("MBXTO").first();
Element msg = doc.select("MSGTEXT").first();
System.out.println(" to " + idto.text() + " " +
msg.text());
System.out.println();
} // End of if
} // End of while
try
{
// Print to output file
sc = new Scanner (log);
while(sc.hasNext())
{
String line=sc.nextLine();
if(line.contains("LANTALK")){
Document doc = Jsoup.parse(line);
Element idto = doc.select("MBXTO").first();
Element msg = doc.select("MSGTEXT").first();
outFile.println(" to " + idto.text() + " " +
msg.text());
outFile.println();
outFile.println();
} // End of if
} // End of while
} // end of try
I'm getting input from a log file, here's a sample of what it looks like and the lines that I'm filtering out:
08:25:20.740 [D] [T:000FF0] [F:LANTALK2C] <CMD>LANMSG</CMD>
<MBXID>1124</MBXID><MBXTO>5760</MBXTO><SUBTEXT>LanTalk</SUBTEXT><MOBILEADDR>
</MOBILEADDR><LAP>0</LAP><SMS>0</SMS><MSGTEXT>and I talked to him and he
gave me a credit card number</MSGTEXT>
08:25:20.751 [+] [T:000FF0] [S:1:1:1124:5607:5] LANMSG [15/2 | 0]
08:25:20.945 [+] [T:000FF4] [S:1:1:1124:5607:5] LANMSGTYPESTOPPED [0/2 | 0]
08:25:21.327 [+] [T:000FE8] [S:1:1:1124:5607:5] LANMSGTYPESTARTED [0/2 | 0]
So far, I've been able to filter the line that contains the message (LANMSG). And from that, I've been able to get the id number of the recipient (MBXTO). But the next line contains the sender's id, which I need to pull out and display. ([S:1:1:1124:SENDERID:5]). How should I do this? Below is a copy of the output I'm getting:
to 5760 and I talked to him and he gave me a credit card number
And here's what I need to get:
SENDERID to 5760 and I talked to him and he gave me a credit card number
Any help you guys could give me on this would be great. I'm just not sure how to go about getting the information I need.

Your answer isn't clear enough, but as it seems like you have not used regex in this code... remember to specify what have you tried before asking.
Anyways the regex you're searching for is:
(\d{2}:\d{2}:\d{2}\.\d{3})\s\[D\].+<MBXID>(\d+)<\/MBXID><MBXTO>(\d+)<\/MBXTO>.+<MSGTEXT>(.+)<\/MSGTEXT>
Working example in Regex101
It should capture:
$1: 08:25:20.740
$2: 1124
$3: 5760
$4: and I talked to him and he
gave me a credit card number (Note that it also capture \n, or newline, characters).
(Also, you'll use matcher.group(number) instead of $number in Java).
And then you can use these substitution (group reference) terms to get your formatted output.
E.g.: $1 [$2] to [$3] $4
Should return:
08:25:20.740 [1124] to [5760] and I talked to him and he
gave me a credit card number
Remember, when you're going to implement regex in your Java code, you must escape all the backslashes (\), for this reason, this regex looks bigger:
Pattern pattern = Pattern.compile("(\\d{2}:\\d{2}:\\d{2}\\.\\d{3})\\s\\[D\\].+<MBXID>(\\d+)<\\/MBXID><MBXTO>(\\d+)<\\/MBXTO>.+<MSGTEXT>(.+)<\\/MSGTEXT>", Pattern.MULTILINE + Pattern.DOTALL);
// Multiline is used to capture the LANMSG more than once, and Dotall is used to make the '.' term in regex also match the newline in the input
Matcher matcher = pattern.matcher(input);
while (matcher.find()){
String output = matcher.group(1) + " [" + matcher.group(2) + "] to [" + matcher.group(3) + "] " + matcher.group(4);
System.out.println(output);
}
And for your second problem Oh, you have edited and erased it already. . . But I'll still answer:
You can parse the $2 and $3 and make them return an integer:
int id1 = Integer.parseInt(matcher.group(2));
int id2 = Integer.parseInt(matcher.group(3));
This way you can create a method to return a name for these IDs. e.g.: UserUtil.getName(int id)

Split string after n amount of digits occurrence

I'm parsing some folder names here. I have a program that lists subfolders of a folder and parses folder names.
For example, one folder could be named something like this:
"Folder.Name.1234.Some.Info.Here-ToBeParsed"
and I would like to parse it so name would be "Folder Name". At the moment I'm first using string.replaceAll() to get rid of special characters and then there is this 4-digit sequence. I would like to split string on that point. How can I achieve this?
Currently my code looks something like this:
// Parsing string if regex p matches folder's name
if(b) {
//System.out.println("Folder: \" " + name + "\" contains special characters.");
String result = name.replaceAll("[\\p{P}\\p{S}]", " "); // Getting rid of all punctuations and symbols.
//System.out.println("Parsed: " + name + " > " + result);
// If string matches regex p2
if(b2) {
//System.out.println("Folder: \" " + result + "\" contains release year.");
String parsed_name[] = result.split("20"); // This is the line i would like to split when 4-digits in row occur.
//System.out.println("Parsed: " + result + " > " + parsed_name[0]);
movieNames.add(parsed_name[0]);
}
Or maybe there is even easier way to do this? Thanks in advance!

You should keep it simple like this:
String name = "Folder.Name.1234.Some.Info.Here-ToBeParsed";
String repl = name.replaceFirst( "\\.\\d{4}.*", "" ).
replaceAll( "[\\p{P}\\p{S}&&[^']]+", " " );
//=> Folder Name
replaceFirst is removing everything after a DOT and 4 digits
replaceAll is replacing all punctuation and space (except apostrophe) by a single space

I want to check if a word or a set of words exists in a String

My requirement is to check if a group of words or a single word is present in a larger string. I tried using String.contains() method but this fails in case the larger string has new line character. Currently I am using a regex mentioned below. But this works for only one word. The searched text is a user entered value and can contain more than one word. This is an android application.
String regex = ".*.{0}" + searchText + ".{0}.*";
Pattern pattern = Pattern.compile(regex);
pattern.matcher(largerString).find();
Sample String
String largerString ="John writes about this, and John writes about that," +
" and John writes about everything. ";
String searchText = "about this";

Why not just replace line breaks with spaces, and on top of that, convert it all to lower case?
String s = "hello";
String originalString = "Does this contain \n Hello?";
String formattedString = originalString.toLowerCase().replace("\n", " ");
System.out.println(formattedString.contains(s));
Edit: Thinking about it, I don't really understand how line breaks make a difference...
Edit 2: I was right. Line breaks don't matter.
String s = "hello";
String originalString = "Does this contain \nHello?";
String formattedString = originalString.toLowerCase();
System.out.println(formattedString.contains(s));

here is code not using regex.
String largerString = "John writes about this, and John writes about that," +" and John writes about everything. ";
String searchText = "about this";
Pattern pattern = Pattern.compile(searchText);
Matcher m = pattern.matcher(largerString);
if(m.find()){
System.out.println(m.group().toString());
}
Result:
about this
I hope it will help you.

Intelligent String Parsing in java

I have an email Subject line that i need to parse. I need to find first occurance of any word given in a list of words and get the next word which can be separated by
('=' or ',' or ';' or 'blank' or '.').
for example
list of word for customer ["customer","client","kunden","kd.nr."]
list of word for Order ["order","auftrag","auftragsnummer","auftragnr."]
separator : [= , ; .]
subjectline: Customer 2013ABC has send an Aufrag 2056899A for Motif=A
I need to parse the information like
customer=2013ABC
order=2056899A
Motif=A
I am using Java 7 so Scanner class can be used as well.
Thanks for any tips in advance

You can achieve this by using regular expressions, here is a sample code:
Pattern p = Pattern.compile(".*(customer|client|kunden|kd\\.nr\\.)[=,;\\. ]*(\\w*).*(order|auftrag|auftragsnummer|auftragnr\\.)[=,;\\. ]*(\\w*).*[ ](.*)$", Pattern.CASE_INSENSITIVE);
String subject = "subjectline: kd.nr. 2013ABC has send an Auftrag 2056899A for Motif=A";
Matcher m = p.matcher(subject);
if(m.matches()) {
System.out.println(m.group(1) + " : " + m.group(2) );
System.out.println(m.group(3) + " : " + m.group(4));
System.out.println(m.group(5));
}
Hope this helps.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Cleaning a text with Regular expression - java

Related

cannot split a specific kind of strings using Java

Parsing a Log File to Display Data from Multiple Lines Using Regular Expressions

Split string after n amount of digits occurrence

I want to check if a word or a set of words exists in a String

Intelligent String Parsing in java

Categories

Resources