Regular expression illegal character in Java

Regular expression illegal character in Java - java

I've been looking through the Internet an after a big headache, cannon't find why this regular expression is wrong:
"\"\w*&&[\p{Punct}]\"["+sepChar+"]\"\w*&&[\p{Punct}]\""
I'm trying to read a master data file with the following pattern (quotes included):
"TEXTVALUE":"TEXTVALUE":"TEXTVALUE"
and split each line with the regular expression above.
So, for example:
"Hello:John":"Hello:World":"Hello:Mark"
will be splitted into:
{"Hello:John", "Hello:World", "Hello:Mark"}

The backwards slash is the escape character in Java. You need to use two backslashes \\ to include a single backslash in the regex.
Try:
"\"\\w*&&[\\p{Punct}]\"["+sepChar+"]\"\\w*&&[\\p{Punct}]\""

Ok.
Thanks to #kevin-bowersox for the help.
It seems that Oracle has done a great job improving Java with version 7.
With this code:
File file = new File(someFile);
BufferedReader br = new BufferedReader(file);
String line = null;
while((line = br.readLine()) != null){
//todo
}
If your file has been formatted with a constant patern, for example:
"TEXTVALUE":"TEXTVALUE":"TEXTVALUE"
It reads:
"TEXTVALUE-->TEXTVALUE-->TEXTVALUE"
where '-->' stands for tabs ('\t')
So, at the end, my solution is:
public ArrayList getSplittedTextFromFile(String filePath) throws FileNotFoundException, IOException{
ArrayList<String[]> ret = null;
if (!filePath.isEmpty()){
File input = new File(filePath);
BufferedReader br = new BufferedReader(input);
String line = null;
while((line = br.readLine()) != null){
String[] aSplit = line.split("\\t");
if (ret == null)
ret = new ArrayList<>();
ret.add(aSplit);
}//while
}//fi
}//fnc

Related

BufferedReader multiple lines as one String

I am trying to read multiple lines from a file into an ArrayList as a String.
What I aim to do is to make it so the program reads from a file line by line until the reader sees a specific symbol (- in this case) and saves those rows as one single String. the code below makes every row a new string that it later adds to the list instead.
BufferedReader br = null;
br = new BufferedReader(new FileReader(file));
String read;
while ((read = br.readLine()) != null) {
String[] splited = read.split("-");
carList.add(Arrays.toString(splited));
}
for (String carList2 : carList) {
System.out.println(carList2);
System.out.println("x");
}

First, you need to check if the read line contains "-".
If it doesn't, concatenate the line with the previous ones.
If it does, concatenate only the first part of the line with the previous line.
This is a quick implementation:
BufferedReader br = null;
br = new BufferedReader(new FileReader(file));
String read;
String concatenatedLine = "";
while ((read = br.readLine()) != null) {
String[] splited = read.split("-");
// if line doesn't contains "-", splited[0] and read are equals
concatenatedLine += splited[0];
if (splited.length > 1) {
// if read line contains "-", there will be more than 1 element
carList.add(Arrays.toString(splited)); // add to the list
// store the second part of the line, in order to add it to the next ones
concatenatedLine = splited[1];
}
}
Note the output could not be what is expected if a line contains more than one -.
Also, concatenating String using + is not the best way to do it, but I let you find out more about that.

It's not very clear for me what is the output you desire.
If you would like to have each customer on one string without "-"
then you could try the following code:
while ((read = br.readLine()) != null) {
String splited = read.replace("-", " ");
carList.add(splited);
}

String Tokenizer Not Registering String on Second Line as a Token When Reading File Despite Using .nextToken()

I am trying to read a file. The file in question has two strings, one on its own line, like this:
COMETQ
HVNGAT
I am trying to assign each string to its own String variable. However, when I run my code (below), I get a NoSuchElementException for the second .nextToken().
BufferedReader f = new BufferedReader(new FileReader("ride.in"));
PrintWriter out = new PrintWriter(new BufferedWriter(new FileWriter("ride.out")));
StringTokenizer st = new StringTokenizer(f.readLine());
String comet = st.nextToken();
String group = st.nextToken();
Can someone help me figure out what's wrong? Thank you!
Note: this is a USACO training page problem. I am just trying to seek help to debug the file reading, not solve the problem.

You only gave it one line:
new StringTokenizer(f.readLine());
You'll have to read all the lines from the file first, then pass the resulting string to the constructor.
Note: in this case, you don't even have to use the StringTokenizer. Just use the BufferedReader

StringTokenizer should be used when text contains delimiters and you want to split. You can also use split() method also.
Syntax :
StringTokenizer stringTokenizer = new StringTokenizer(text, delimiter);
For example :
StringTokenizer stringTokenizer = new StringTokenizer("abc, def", ",");
But in your file, there is no such delimiter present in the string. So, StringTokenizer is of no use here.
I have tested with this :
BufferedReader bufferedReader = new BufferedReader(new FileReader(new File("F:/test.txt")));
String line;
String extracted = "";
while ((line = bufferedReader.readLine()) != null) {
StringTokenizer stringTokenizer = new StringTokenizer(line);
while (stringTokenizer.hasMoreElements()) {
extracted = extracted + stringTokenizer.nextElement().toString() +",";
}
}
bufferedReader.close();
String[] splits = extracted.split(",");
String comet = splits[0];
String group = splits[1];
System.out.println(comet + " " + group);
Output :
COMETQ HVNGAT
Hope this helps you :)

Java code reads UTF-8 text incorrectly

I'm having a problem reading UTF-8 characters in my code (running on Eclipse).
I have a file text which has a few lines in it, for example:
אך 1234
NOTE: There is a \t before the word, and the word should appear on the left, the number on the right... I don't know how to reverse them here, sorry.
That is, a Hebrew word and then a number.
I need to separate the word from the number somehow. I tried this:
BufferedReader br = new BufferedReader(new FileReader(text));
String content;
while ((content = br.readLine()) != null)
{
String delims = "[ ]+";
String[] tokens = content.split(delims);
}
The problem is that for some reason, the code reads content (the first line in the file) as follows:
אך\t1234
...meaning that the space isn't in its correct place.
I suppose I could tokenize the text using the \t, but I'm not sure I should do it, as the file isn't being read correctly...
Does anyone have any idea why this happens?
Thanks so much :-)

I think you are matching a space when there actually is a tab there?
Can you try this:
BufferedReader br = new BufferedReader(new FileReader(text));
String content;
while ((content = br.readLine()) != null)
{
String delims = "\\s";
String[] tokens = content.split(delims);
}

parsing a text file using a java scanner

I am trying to create a method that parses a text file and returns a string that is the url after the colon. The text file looks as follow (it is for a bot):
keyword:url
keyword,keyword:url
so each line consists of a keyword and a url, or multiple keywords and a url.
could anyone give me a bit of direction as to how to do this? Thank you.
I believe I need to use a scanner but couldn't find anything on anyone wanting to do anything similar to me.
Thank you.
edit: my attempt using suggestions below. doesn't quite work. Any help would be appreciated.
public static void main(String[] args) throws IOException {
String sCurrentLine = "";
String key = "hello";
BufferedReader reader = new BufferedReader(
new FileReader(("sites.txt")));
Scanner s = new Scanner(sCurrentLine);
while ((sCurrentLine = reader.readLine()) != null) {
System.out.println(sCurrentLine);
if(sCurrentLine.contains(key)){
System.out.println(s.findInLine("http"));
}
}
}
output:
hello,there:http://www.facebook.com
null
whats,up:http:/google.com
sites.txt:
hello,there:http://www.facebook.com
whats,up:http:/google.com

You should read the file line by line with a BufferedReader as you are doing, I would the recommend parsing the file using regex.
The pattern
(?<=:)http://[^\\s]++
Will do the trick, this pattern says:
http://
followed by any number of non-space characters (more than one) [^\\s]++
and preceded by a colon (?<=:)
Here is a simple example using a String to proxy your file:
public static void main(String[] args) throws Exception {
final String file = "hello,there:http://www.facebook.com\n"
+ "whats,up:http://google.com";
final Pattern pattern = Pattern.compile("(?<=:)http://[^\\s]++");
final Matcher m = pattern.matcher("");
try (final BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(new ByteArrayInputStream(file.getBytes("UTF-8"))))) {
String line;
while ((line = bufferedReader.readLine()) != null) {
m.reset(line);
while (m.find()) {
System.out.println(m.group());
}
}
}
}
Output:
http://www.facebook.com
http://google.com

Use BufferedReader, for text parsing you can use regular expresions.

You should use the split method:
String strCollection[] = yourScannedStr.Split(":", 2);
String extractedUrl = strCollection[1];

Reading a .txt file using Scanner class in Java
http://www.tutorialspoint.com/java/java_string_substring.htm
That should help you.

Check line for unprintable characters while reading text file

My program must read text files - line by line.
Files in UTF-8.
I am not sure that files are correct - can contain unprintable characters.
Is possible check for it without going to byte level?
Thanks.

Open the file with a FileInputStream, then use an InputStreamReader with the UTF-8 Charset to read characters from the stream, and use a BufferedReader to read lines, e.g. via BufferedReader#readLine, which will give you a string. Once you have the string, you can check for characters that aren't what you consider to be printable.
E.g. (without error checking), using try-with-resources (which is in vaguely modern Java version):
String line;
try (
InputStream fis = new FileInputStream("the_file_name");
InputStreamReader isr = new InputStreamReader(fis, Charset.forName("UTF-8"));
BufferedReader br = new BufferedReader(isr);
) {
while ((line = br.readLine()) != null) {
// Deal with the line
}
}

While it's not hard to do this manually using BufferedReader and InputStreamReader, I'd use Guava:
List<String> lines = Files.readLines(file, Charsets.UTF_8);
You can then do whatever you like with those lines.
EDIT: Note that this will read the whole file into memory in one go. In most cases that's actually fine - and it's certainly simpler than reading it line by line, processing each line as you read it. If it's an enormous file, you may need to do it that way as per T.J. Crowder's answer.

Just found out that with the Java NIO (java.nio.file.*) you can easily write:
List<String> lines=Files.readAllLines(Paths.get("/tmp/test.csv"), StandardCharsets.UTF_8);
for(String line:lines){
System.out.println(line);
}
instead of dealing with FileInputStreams and BufferedReaders...

If you want to check a string has unprintable characters you can use a regular expression
[^\p{Print}]

How about below:
FileReader fileReader = new FileReader(new File("test.txt"));
BufferedReader br = new BufferedReader(fileReader);
String line = null;
// if no more lines the readLine() returns null
while ((line = br.readLine()) != null) {
// reading lines until the end of the file
}
Source: http://devmain.blogspot.co.uk/2013/10/java-quick-way-to-read-or-write-to-file.html

I can find following ways to do.
private static final String fileName = "C:/Input.txt";
public static void main(String[] args) throws IOException {
Stream<String> lines = Files.lines(Paths.get(fileName));
lines.toArray(String[]::new);
List<String> readAllLines = Files.readAllLines(Paths.get(fileName));
readAllLines.forEach(s -> System.out.println(s));
File file = new File(fileName);
Scanner scanner = new Scanner(file);
while (scanner.hasNext()) {
System.out.println(scanner.next());
}

The answer by #T.J.Crowder is Java 6 - in java 7 the valid answer is the one by #McIntosh - though its use of Charset for name for UTF -8 is discouraged:
List<String> lines = Files.readAllLines(Paths.get("/tmp/test.csv"),
StandardCharsets.UTF_8);
for(String line: lines){ /* DO */ }
Reminds a lot of the Guava way posted by Skeet above - and of course same caveats apply. That is, for big files (Java 7):
BufferedReader reader = Files.newBufferedReader(path, StandardCharsets.UTF_8);
for (String line = reader.readLine(); line != null; line = reader.readLine()) {}

If every char in the file is properly encoded in UTF-8, you won't have any problem reading it using a reader with the UTF-8 encoding. Up to you to check every char of the file and see if you consider it printable or not.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Regular expression illegal character in Java - java

The backwards slash is the escape character in Java. You need to use two backslashes \\ to include a single backslash in the regex. Try: "\"\\w&&[\\p{Punct}]\"["+sepChar+"]\"\\w&&[\\p{Punct}]\""

Related

BufferedReader multiple lines as one String

String Tokenizer Not Registering String on Second Line as a Token When Reading File Despite Using .nextToken()

Java code reads UTF-8 text incorrectly

parsing a text file using a java scanner

Check line for unprintable characters while reading text file

Categories

Resources

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Regular expression illegal character in Java - java

The backwards slash is the escape character in Java. You need to use two backslashes \\ to include a single backslash in the regex. Try: "\"\\w*&&[\\p{Punct}]\"["+sepChar+"]\"\\w*&&[\\p{Punct}]\""

Related

BufferedReader multiple lines as one String

String Tokenizer Not Registering String on Second Line as a Token When Reading File Despite Using .nextToken()

Java code reads UTF-8 text incorrectly

parsing a text file using a java scanner

Check line for unprintable characters while reading text file

Categories

Resources

The backwards slash is the escape character in Java. You need to use two backslashes \\ to include a single backslash in the regex. Try: "\"\\w&&[\\p{Punct}]\"["+sepChar+"]\"\\w&&[\\p{Punct}]\""