How would you treat/store letter 'CH' in java code for let's say frequency analysis? I haven't found any alphabet libraries that will work with the double lettered 'CH'. Storing in char is no longer an option. All the text processing algorithms just scans one by one. But now I will need to somehow scan ahead to match the pair. There is no 'CH' char in unicode either, are there any other coding tables where 'CH' can be found?
Another way will be to replace 'CH' with '1' in input data files and treat the '1' as another regular character. By which I will loose the option of ASCII codes aritmetics('a' - 't' is nonsense as the 'ch' is missing in ASCII)
Related
I have a huge file and that file contains a lot of illegal characters like in the image below, but these are not all. They are of many different kinds so it's not possible to search for them all and replace them.
Is there a way i can remove these characters. I've tried a lot of solutions like converting to ANSI, or some regex expression but they didn't work. Please help.
EDIT: Even if anyone can tell me how to remove these characters in java, that will be fine too.
Instead of removing specific characters it's easier to implement a white-list filter if you know which types of characters you are expecting.
As per this answer, which explains how to remove emoticons you can try:
String characterFilter = "[^\\p{L}\\p{M}\\p{N}\\p{P}\\p{Z}\\p{Cf}\\p{Cs}\\s]";
String emotionless = aString.replaceAll(characterFilter, "");
To understand what \p{} groups are available look at Classes for Unicode scripts, blocks, categories and binary properties docs:
\p{IsLatin} A Latin script character (script)
\p{InGreek} A character in the Greek block (block)
\p{Lu} An uppercase letter (category)
\p{IsAlphabetic} An alphabetic character (binary property)
\p{Sc} A currency symbol
\P{InGreek} Any character except one in the Greek block (negation)
[\p{L}&&[^\p{Lu}]] Any letter except an uppercase letter (subtraction)
I do have a Java Web Application, where I get some inputs from the user. Once I got this input I have to parse it and the parsing part depends on what kind of input I'll get. I decided to use the Pattern class of java for some of predefined user inputs.
So I need the last 2 regex patterns:
a)Enumaration:
input can be - A03,B24.1,A25.7
The simple way would be to check if there are a comma in there ([^,]+) but it will end up with a lot of updates in to parsing function, which I would like to avoid. So, in addition to comma it should check if it starts with
letter
minimum 3 letters (combined with numbers)
can have one dot in the word
minimum 1 comma (updated it)
b) Mixed
input can be A03,B24.1-B35.5,A25.7
So all of what Enumuration part got, but with addition that it can have a dash minimum one.
I've tried to use multiple online regex generators but didnt get it correct. Would be much appreciated if you can help.
Here is what I got if its B24.1-B35.5 if its just a simple range.
"='.{1}\\d{0,2}-.{1}\\d{0,2}'|='.{1}\\d{1,2}.\\d{1,2}-.{1}\\d{1,2}.\\d{1,2}'";
Edit1: Valid and Invalid inputs
for a)Enumaration
A03,B24.1,A25.7 Valid
A03,B24.1 Valid
A03,B24.1-B25.1 -Invalid because in this case (enumaration) it should not contain dash
A03 invalid because no comma
A03,B24.1 - Valid
A03 Invalid
for b)Mixed
everything that a enumeration has with addition that it can have dash too.
You can use this regex for (a) Enumeration part as per your rules:
[A-Za-z][A-Za-z0-9]{2,}(?:\.[A-Za-z0-9]{1,})?(?:,[A-Za-z][A-Za-z0-9]{2,}(?:\.[A-Za-z0-9]{1,})?)+
Rules:
Verifies that each segment starts with a letter
Minimum of three letters or numbers [A-Za-z][A-Za-z0-9]{2,}
Optionally followed by decimal . and one or more alphabets and numbers i.e (?:\.[A-Za-z0-9]{1,})?
Same thing repeated, and seperated by a comma ,. Also must have atleast one comma so using + i.e (?:,[A-Za-z][A-Za-z0-9]{2,}(?:\.[A-Za-z0-9]{1,})?)+
?: to indicate non-capturing group
Using [A-Za-z0-9] instead of \w to avoid underscores
Regex101 Demo
For (b) Mixed, you haven't shared too many valid and invalid cases, but based on my current understanding here's what I have:
[A-Za-z][A-Za-z0-9]{2,}(?:\.[A-Za-z0-9]{1,})?(?:[,-][A-Za-z][A-Za-z0-9]{2,}(?:\.[A-Za-z0-9]{1,})?)+
Note that , from previous regex has been replaced with [,-] to allow - as well!
Regex101 Demo
// Will match
A03,B24.1-B35.5,A25.7
A03,B24.1,A25.7
A03,B24.1-B25.1
Hope this helps!
EDIT: Making sure each group starts with a letter (and not a number)
Thanks to #diginoise and #anubhava for pointing out! Changed [A-Za-z0-9]{3,} to [A-Za-z][A-Za-z0-9]{2,}
As I said in the comments, I would chop the input by commas and verify each segment separately. Your domain ICD 10 CM codes is very well defined and also I would be very wary of any input which could be non valid, yet pass the validation.
Here is my solution:
regex
([A-TV-Z][0-9][A-Z0-9](\.?[A-Z0-9]{0,4})?)
... however I would avoid that.
Since your domain is (moste likely) medical software, people's lives (or at least well being) is at stake. Not to mention astronomical damages and the lawyers ever-chasing ambulances. Therefore avoid the easy solution, and implement the bomb proof one.
You could use the regex to establish that given code is definitely not valid. However if a code passes your regex it does not mean that it is valid.
bomb proof method
See this example: O09.7, O09.70, O09.71, O09.72, O09.73 are valid entries, but O09.1 is not valid.
Therefore just get all possible codes. According to this gist there are 42784 different codes. Just load them to memory and any code which is not in the set, is not valid. You could compress said list and be clever about the encoding in memory, to occupy less space, but verbatim all codes are under 300kB on disk, so few MBs max in memory, therefore not a massive cost to pay for a price of people not having left instead of right kidney removed.
I am currently writing a program where there is a text file with several million digits in it, and I have to go through it looking for a random string of 6 numbers (entered by the user). There are several constraints to this, which is making it difficult.
Must used BufferedReader
Each character can only be read once (I got it working with a bunch of nested if statements, but the way I did it violated this rule)
Cannot use any methods from the string class (so I can't put the read characters together and compare to the original string with .equals()). I have already broken up the original string into the 6 individual characters.
Not allowed to store read characters into an array of any kind, only into character variables (of which there should be 6)
Once a match has been found, it is to report the location to the user (I just need to keep a count variable that I increment with every character read) and continue on until the end of the file is reached. There can be multiple matches in the file.
Any help with this would be great, I'm at a loss for what to do.
You have a haystack to search, say 98712365478932145697, and a needle to find, say 893.
How about:
use BufferedReader.read() to read from the haystack a character at a time
if the character is the first character in your needle, store it in the first character variable
if the next character is the second character in your needle, store it in the second character variable, else, if it's the first character in your needle, start over and store it in the first character variable
if the next character is the third character in your needle, store it in the third character variable, else, if it's the first character in your needle, start over
etc
if you fill the last character variable, you have found the needle in the haystack, you can stop here or start over and look for another occurrence
I won't write the code as it's fairly trivial and this sounds like homework, but that should give you a nudge.
I have a scenario where I store data with ASCII code.
Example:
"UKI:PPP1ZZ.General to File¦WB"
Also I have a scenario where unknowingly some special characters get stored at the end of the line.
Example:
"UKI:PPP1ZZ.General.File.WELL ".
So as it can be seen in my second example, i get ASCII code after 'WELL' which get stored in database as lagging special codes from my Talend ETL job.
Now I wrote an expression in java to be used in Talend to clean the lagging special codes which is below:
row1.sheetname.replaceAll("[^\\x00-\\x7F]","")
But the issue which I find with my above expression is that it will replace the ASCII code that is present in my first example which I don't want.
Also the other thing is that I only want to replace the ASCII code that is present at the end of my lines.
So is there any way to achieve this?
row1.sheetname = row1.sheetname
.replaceFirst("(?u)([^\u0000-\u001f\u007f]|\\P{ASCII})$","");
This removes the last character: ASCII control char or non-ASCII char (capital P is "non-"). $ = end of text.
I need some help on a Java assignment. We are given a scrambled text file, which was scrambled using a substitution cipher, where every letter in the text is simply swapped out for another letter. My program is almost finished, but I'm having trouble figuring out how to write the final "descramble" method, which takes the scrambled text and replaces each letter with its correct substitute in order to reveal the correct text.
These are the instructions provided in the assignment:
The descrambling is done by using the letter in the scrambled text as the index in the char array. For example, if the scrambled text has a letter B, you replace it with the character it index 2 in the array. All whitespace and punctuation from the original file should also be in the descrambled file, only the letters should have been changed. Additionally, if a letter was capitalized in the original file, it should be capitalized in the descrambled file (similarily, lowercase letters should still be lowercase).
I'm not asking to have the answer given to me, since this is for school. I just can't seem to properly understand these instructions, what exactly is it that I need to do to successfully decode the text? Mostly, I don't understand how I can use a letter as an index for a char array, aren't indexes always integers?
You didn't say what language you're working in, so I'll use C/Java. You'll want to compute an integer index. Assume for the moment that scrambled_char is an upper case letter then it's:
// index into descrambling array:
int index = scrambled_char - 'A' + 1;
This has value 1 for character A, 2 for B, etc. as the problem says. It sounds like you're being given the descrambling array. For example:
char descramble[] = "_ZYX ... ";
This would cause A to be translated to Z, B to Y, C to X, ...
The descrambled character will be
char descrambled_char = descramble[index];
Now you just need to work out how to handle lower case letters, white space, and punctuation.