Regex to extract Content-Type - java

How can extract the lines with the Content-Type info? In some mails, these headers can be in 2 or 3 or even 4 lines, depending how it was sent. This is one example:
Content-Type: text/plain;
charset="us-ascii"
Content-Transfer-Encoding: 7bit
Lorem ipsum dolor sit amet, consectetur adipisicing elit,
sed do eiusmod tempor incididunt ut labore et dolore magna
aliqua. Ut enim ad minim veniam, quis nostrud exercitation
ullamco laboris nisi ut aliquip ex ea commodo consequat.
Duis aute irure dolor in reprehenderit in voluptate velit
esse cillum dolore eu fugiat nulla pariatur. Excepteur sint
occaecat cupidatat non proident, sunt in culpa qui officia
deserunt mollit anim id est laborum.
I tried this regex: ^(Content-.*:(.|\n)*)* but it grabs everything.
How should I phrase my regex in Java to get only part:
Content-Type: text/plain;
charset="us-ascii"
Content-Transfer-Encoding: 7bit

Pattern regex = Pattern.compile("^Content-Type(?:.|\\s)*?(?=\n\\s+\n)");
This will match everything which starts with Content-Type until the first completely empty line.

You can try this regex
Pattern regex = Pattern.compile("Content-Type.*?(?=^\\s*\n?\r?$)",
Pattern.DOTALL | Pattern.MULTILINE);

^Content-(.|\n)*\n\n
This will match until the blank line.

Checkout the relevant RFCs for the exact definition of headers. IIRC in essence you need to consider everything with a linebreak and one or more whitespace characters (eg space, nonbreaking space, tab) to be part of the same header line. I also believe that you should collapse the linebreak and whitespace(s) into a single whitespace element (note: there might be more complex rules, so check the RFCs).
Only if the new line directly starts with a non-whitespace character it is the next header, and if it is immediately followed by another linebreak it ends the header section and starts the body section.
BTW: Why not just use JavaMail instead of reinventing the wheel?

This tested script works for me:
import java.util.regex.*;
public class TEST
{
public static void main( String[] args )
{
String subjectString =
"Content-Type: text/plain;\r\n" +
" charset=\"us-ascii\"\r\n" +
"Content-Transfer-Encoding: 7bit\r\n" +
"\r\n" +
"Lorem ipsum dolor sit amet, consectetur adipisicing elit,\r\n" +
"sed do eiusmod tempor incididunt ut labore et dolore magna\r\n" +
"aliqua. Ut enim ad minim veniam, quis nostrud exercitation\r\n" +
"ullamco laboris nisi ut aliquip ex ea commodo consequat.\r\n" +
"Duis aute irure dolor in reprehenderit in voluptate velit\r\n" +
"esse cillum dolore eu fugiat nulla pariatur. Excepteur sint\r\n" +
"occaecat cupidatat non proident, sunt in culpa qui officia\r\n" +
"deserunt mollit anim id est laborum.\r\n";
String resultString = null;
Pattern regexPattern = Pattern.compile(
"^Content-Type.*?(?=\\r?\\n\\s*\\n)",
Pattern.DOTALL | Pattern.CASE_INSENSITIVE |
Pattern.UNICODE_CASE | Pattern.MULTILINE);
Matcher regexMatcher = regexPattern.matcher(subjectString);
if (regexMatcher.find()) {
resultString = regexMatcher.group();
}
System.out.println(resultString);
}
}
It works for text having both valid: \r\n and (invalid but commonly used in the wild): \n Unix style line terminations.

Related

Java libraries that facilitate "stitching" together multiple multi-line strings

I am trying to stitch together multiple multi-line strings together to create the effect of several columns of text. Consider the three text blocks below:
Lorem ipsum dolor si
t amet, consectetur
adipiscing elit, sed
do eiusmod tempor in
cididunt ut labore e
t dolore magna aliqu
a.
Volutpat consequat m
auris nunc congue ni
si vitae. Sed risus
ultricies tristique
nulla aliquet enim t
ortor at auctor.
Urna porttitor rhonc
us dolor purus non.
Interdum varius sit
amet mattis vulputat
e enim nulla.
The block width is fixed at 20 characters. Ignore the wrapping of words.
What I want to do is stitch or append these separate multi-line strings together to produce the following:
Lorem ipsum dolor si Volutpat consequat m Urna porttitor rhonc
t amet, consectetur auris nunc congue ni us dolor purus non.
adipiscing elit, sed si vitae. Sed risus Interdum varius sit
do eiusmod tempor in ultricies tristique amet mattis vulputat
cididunt ut labore e nulla aliquet enim t e enim nulla.
t dolore magna aliqu ortor at auctor.
a.
In this case, the column spacing is 4 characters wide.
Is anyone aware of a Java library or utility that facilitates this? If not implemented in Java, is there anything that could do this, that could be invoked from Java code?

Check if one List<String> contains specific string from another list

I have a List<String> emails containing emails, of length n , and another List<String> keywords for containing keywords, of the same length. These lists should meet following condition: For each index i emails.get(i).contains(keywords.get(i))
So, if emails.get(0) == "quick brown fox", then keywords.get(0) == "fox".
if emails.get(5) == "foo bar", then keywords.get(5) == "foo".
How can I check (other than for loop) that each email contains a keyword?
First, it may be needed to check the size of both lists, then to compare corresponding list items, IntStream should be used:
public static boolean allKeywordsFound(List<String> emails, List<String> keywords) {
return emails.size() == keywords.size() &&
IntStream.range(0, emails.size())
.allMatch(i -> emails.get(i).contains(keywords.get(i)));
}
I see that others correctly answered your question but here's my take on the issue.
I presume you want the emails to be checked in order so here's a piece of code that uses Stream API instead of a for loop, I also put together the emails list and the result into a Map since you didn't specify whether you want the resulting boolean value to be for all the emails together or if you want a boolean value for each email containing the same-position keyword:
//mock data initialization
List<String> emails = new ArrayList<>();
List<String> keywords = new ArrayList<>();
//mock data initialization
emails.add("Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua");
emails.add("eu lobortis elementum nibh tellus molestie nunc non blandit massa enim nec dui nunc mattis enim ut tellus elementum sagittis");
emails.add("Dignissim suspendisse in est ante in nibh mauris");
//mock data initialization
keywords.add("consectetur");
keywords.add("Foo");
keywords.add("Dignissim");
//initialized a list to contain whether a keyword exists for each email
List<Boolean> exists = new ArrayList<>();
//loaded it with boolean values (the exists List has the same order as the emails list)
emails.forEach(email -> exists.add(email
.contains(keywords
.get(emails
.indexOf(email)))));
//since I don't know what you wanna do with the result, I decided to just put them together in a Map
//with the email string as the key and the existence variable as a value
LinkedHashMap mapOfTruth = new LinkedHashMap();
emails.forEach(email -> mapOfTruth.put(email, exists.get(emails.indexOf(email))));
Output
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua = true
eu lobortis elementum nibh tellus molestie nunc non blandit massa enim nec dui nunc mattis enim ut tellus elementum sagittis = false
Dignissim suspendisse in est ante in nibh mauris = true
This code using Java streams/maps checks if each email contains their respective keyword.
boolean allEmailsContainKeyword(List<String> emails, List<String> keywords) {
return !emails.stream().map(email -> email.contains(keywords.get(emails.indexOf(email)))).collect(Collectors.toList()).contains(false);
}

time complexity Java

The program counts maximum number of words in a sentence considering a text is given.A text can have multiple sentences. I have to find sentence with maximum words
I have the following code I need to optimize the time complexity for this
it should not take more than 5 sec
import java.util.*;
import java.io.*;
class Solution {
public int solution(String S) {
// write your code in Java SE 8
List<Integer> wca=new ArrayList<Integer>();
int wc,i;
String[] sent=S.split("\\.+");
while(sent.length!=0){
for(i=0;i<sent.length;i++){
wc=sent[i].split("\\s+").length;
wca.add(wc);
}
}
Collections.sort(wca);
return(wca.get(wca.size()-1));
}
}
You don't need to sort the list to simply find its largest value. In fact you don't need a list at all. Simply store the longest sentence as you go along.
public int findLongestSentence(String paragraph) {
String[] sentences = paragraph.split("\\.|\\!|\\?");
int maxSentenceLength = 0;
for(String sentence : sentences) {
String[] words = sentence.split("\\s");
maxSentenceLength = Math.max(words.length, maxSentenceLength);
}
return maxSentenceLength;
}
This could be made more efficient by not using the split() method, but that would not affect the asymptotic time complexity.
P.S. Informative variable names are important, and along with good code formatting, made your code much easier to read.
The program counts maximum number of words in a sentence
Suppose you have this text:
Lorem ipsum dolor sit amet, consectetur adipisici elit, sed eiusmod tempor incidunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquid ex ea commodi consequat. Quis aute iure reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint obcaecat cupiditat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
Split by dot (.)
arr[0]= Lorem ipsum dolor sit amet, consectetur adipisici elit, sed eiusmod tempor incidunt ut labore et dolore magna aliqua
arr[1]= Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquid ex ea commodi consequat
arr[2]= Quis aute iure reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur
arr[3]= Excepteur sint obcaecat cupiditat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
Amount of words is related with amount of spaces, so count the spaces will be easier.
int max = 0; // this hold the maximal space count
int index = 0; // this hold the index of the maximal space count
Iterate over the array
int spaces = arr[i].length - arr[i].replace(" ","").length;
if(spaces>max){
max = spaces;
index = i;
}
At the end of that loop you will get the maximal amount of words and the index of the array of that sentence.
From what i understand, you want to parse an input text, so you can get the wordcount of each sentence and find the sentence witht the highest one.
First of all, you are only returning the highest wordcount, nothing to identify the sentence itself.
Second (as others have already pointed out) the sorting can be substituted with keeping only the longest sentence, and upon finding a longer one, replacing it. That would indeed bring it to O(n).
Third is the problem that sentences dont only end with periods.
String longest;
for (String s : sentences){
if(s.split(" ").length>longest.split(" ").length))
{
longest = s
}
}
return longest;

How to split a string (with params) in Java

I have a string that alternates between text and chapter marks. I'd like to have it in a key-value-array where the key is the chapter name and the value is the chapter content. The text looks like this:
<chapter name="First chapter" />
Lorem ipsum dolor sit amet, consetetur sadipscing elitr.
<chapter name="Second chapter" />
Sed diam nonumy eirmod tempor invidunt ut labore et.
<chapter name="Third chapter" />
Dolore magna aliquyam erat, sed diam voluptua.
The resulting array is supposed to look like this:
[
{"First chapter", "Lorem ipsum dolor sit amet, consetetur sadipscing elitr."},
{"Second chapter", "Sed diam nonumy eirmod tempor invidunt ut labore et."},
{"Third chapter", "Dolore magna aliquyam erat, sed diam voluptua."}
]
How can I do this?
You can use regular expression to locate subject and content. Your case is very suitable for that.
The link below has a summary for regex in java.
http://www.vogella.com/tutorials/JavaRegularExpressions/article.html
As suggested by #devd with this posting, the solution to the above case is XPath. There is an example here.

How to match any word but ignore those that starts with multiple whitespaces?

What I am trying to achieve is to match all words in text, but ignore those words in line (before new line) that start with 4 whitespaces.
Example
Text file to find words:
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do
eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut
enim ad minim veniam, quis nostrud exercitation ullamco laboris
nisi ut aliquip ex ea commodo consequat.
This must NOT be matched. Because it has 4 whitespaces at the beginning.
Lorem ipsum dolor sit amet. Ut enim ad minim veniam.
So, the words in following line should be NOT considered to match pattern:
This must NOT be matched. Because it has 4 whitespaces at the beginning.
Code
Here is my regex and it can find all words:
\\b[A-Za-z]+\\b
I know that in Java's RegEx syntax there is except which is ^ symbol but I only know how to use it in more simple expressions.
Maybe following snippet could be a basis for what you want to achieve.
String[] lines = {"Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do",
"eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut",
"enim ad minim veniam, quis nostrud exercitation ullamco laboris",
"nisi ut aliquip ex ea commodo consequat.",
"",
" This must NOT be matched. Because it has 4 whitespaces at the beginning.",
"",
"Lorem ipsum dolor sit amet. Ut enim ad minim veniam."};
for (String line : lines) {
if (!line.startsWith(" ")) {
String[] words = line.split("[\\p{IsPunctuation}\\p{IsWhite_Space}]+");
System.out.println("words = " + Arrays.toString(words));
}
}
output
words = [Lorem, ipsum, dolor, sit, amet, consectetur, adipiscing, elit, sed, do]
words = [eiusmod, tempor, incididunt, ut, labore, et, dolore, magna, aliqua, Ut]
words = [enim, ad, minim, veniam, quis, nostrud, exercitation, ullamco, laboris]
words = [nisi, ut, aliquip, ex, ea, commodo, consequat]
words = []
words = []
words = [Lorem, ipsum, dolor, sit, amet, Ut, enim, ad, minim, veniam]
PS: the regex has been borrowed from this answer
The following should do that
(?<!\s{4})\\b[A-Za-z]+\\b
It begins with a negative lookbehind so it won't match anything with \s{4} preceding it.

Categories