Split pdf into sections based on titles/bookmarks using regex

Split pdf into sections based on titles/bookmarks using regex - java

I am reading a pdf and extract the text from it into an ArrayList. I am collecting all the bookmarks from the pdf which in this case are the titles of each section and I add them into a list. I want to extract each section by using regex based on the titles/bookmarks. Below what I have tried so far.
for (String text1 : texts) {
for (int title = 0; title < titles.size(); title++) {
try {
//Regex Pattern to find the text between two titles
Pattern p = Pattern.compile(titles.get(title) +
".*(\\n.*)+?(?=" + titles.get(title + 1) + ')');
// the issue here is that the title+1 goes over the size of the titles
Matcher matcherTitle = p.matcher(text1);
// System.out.println(p.matcher(text));
for (int i = 1; i <= matcherTitle.groupCount(); i++) {
System.out.println(matcherTitle.group(i));
}
} catch (Exception e) {
Pattern p = Pattern.compile(titles.get(title) +
"(.|\\n)\\.*");
Matcher matcherTitle = p.matcher(text1);
// System.out.println(p.matcher(text).results().toString());
// System.out.println(titles.get(title) + " This title is the last Title");
}
}
}
Let's say that titles is the list with the titles and texts is the list with the text. Unfortunately, nothing is printed out. In the end I would like to write each section into a txt file with the name of the file being the title of each section.

Related

Why does formatting a JList cell using HTML lead to the text being centered in the cell?

I have a JList in which I am formatting the text before it is added to the JList cell using HTML. I'm doing this because I'm lazy and don't want to get complex with a cellRenderer. I only need 2 separate lines in each JList cell so HTML just seemed quicker and easier for such a simple requirement, however when I run this it correctly formats it, however, the text does not start on the edge of the button. I'm assuming this is because the HTML takes up space in whitespaces in which case I assume I can't fix that?
public static void receiveDataEmailList(String data) {
Scanner scLine = new Scanner(data).useDelimiter("&");
int num = scLine.nextInt();
String[] emails = new String[num];
for (int i = 0; i < num; i++) {
emails[i] = "" + "<html><ul style=\"list-style-type:none;\"><li style=\"font-size:10px\">" + scLine.next() + "</li><li style=\"font-size:8px\">" + "Subject: " + "Hello" + "</li></ul></html>";
}
EmailList.setListOfEmails(emails);
}

Length of String within tags in java

We need to find the length of the tag names within the tags in java
{Student}{Subject}{Marks}100{/Marks}{/Subject}{/Student}
so the length of Student tag is 7 and that of subject tag is 7 and that of marks is 5.
I am trying to split the tags and then find the length of each string within the tag.
But the code I am trying gives me only the first tag name and not others.
Can you please help me on this?
I am very new to java. Please let me know if this is a very silly question.
Code part:
System.out.println(
getParenthesesContent("{Student}{Subject}{Marks}100{/Marks}{/Subject}{/Student}"));
public static String getParenthesesContent(String str) {
return str.substring(str.indexOf('{')+1,str.indexOf('}'));
}

You can use Patterns with this regex \\{(\[a-zA-Z\]*)\\} :
String text = "{Student}{Subject}{Marks}100{/Marks}{/Subject}{/Student}";
Matcher matcher = Pattern.compile("\\{([a-zA-Z]*)\\}").matcher(text);
while (matcher.find()) {
System.out.println(
String.format(
"tag name = %s, Length = %d ",
matcher.group(1),
matcher.group(1).length()
)
);
}
Outputs
tag name = Student, Length = 7
tag name = Subject, Length = 7
tag name = Marks, Length = 5

You might want to give a try to another regex:
String s = "{Abc}{Defg}100{Hij}100{/Klmopr}{/Stuvw}"; // just a sample String
Pattern p = Pattern.compile("\\{\\W*(\\w++)\\W*\\}");
Matcher m = p.matcher(s);
while(m.find()) {
System.out.println(m.group(1) + ", length: " + m.group(1).length());
}
Output you get:
Abc, length: 3
Defg, length: 4
Hij, length: 3
Klmopr, length: 6
Stuvw, length: 5
If you need to use charAt() to walk over the input String, you might want to consider using something like this (I made some explanations in the comments to the code):
String s = "{Student}{Subject}{Marks}100{/Marks}{/Subject}{/Student}";
ArrayList<String> tags = new ArrayList<>();
for(int i = 0; i < s.length(); i++) {
StringBuilder sb = new StringBuilder(); // Use StringBuilder and its append() method to append Strings (it's more efficient than "+=") String appended = ""; // This String will be appended when correct tag is found
if(s.charAt(i) == '{') { // If start of tag is found...
while(!(Character.isLetter(s.charAt(i)))) { // Skip characters that are not letters
i++;
}
while(Character.isLetter(s.charAt(i))) { // Append String with letters that are found
sb.append(s.charAt(i));
i++;
}
if(!(tags.contains(sb.toString()))) { // Add final String to ArrayList only if it not contained here yet
tags.add(sb.toString());
}
}
}
for(String tag : tags) { // Printing Strings contained in ArrayList and their length
System.out.println(tag + ", length: " + tag.length());
}
Output you get:
Student, length: 7
Subject, length: 7
Marks, length: 5

yes use regular expression, find the pattern and apply that.

Java Regex to find in a Text all the possible pairs of a list

I have a List of Strings containing names and surnames and i have a free text.
List<String> names; // contains: "jon", "snow", "arya", "stark", ...
String text = "jon snow and stark arya";
I have to find all the names and surnames, possibly with a Java Regex (so using Pattern and Matcher objects). So i want something like:
List<String> foundNames; // contains: "jon snow", "stark arya"
I have done this 2 possible ways but without using Regex, they are not static beacause part of a NameFinder class that have a list "names" that contains all the names.
public List<String> findNamePairs(String text) {
List<String> foundNamePairs = new ArrayList<String>();
List<String> names = this.names;
text = text.toLowerCase();
for (String name : names) {
String nameToSearch = name + " ";
int index = text.indexOf(nameToSearch);
if (index != -1) {
String textSubstring = text.substring(index + nameToSearch.length());
for (String nameInner : names) {
if (name != nameInner && textSubstring.startsWith(nameInner)) {
foundNamePairs.add(name + " " + nameInner);
}
}
}
}
removeDuplicateFromList(foundNamePairs);
return foundNamePairs;
}
or in a worse (very bad) way (creating all the possible pairs):
public List<String> findNamePairsInTextNotOpt(String text) {
List<String> foundNamePairs = new ArrayList<String>();
text = text.toLowerCase();
List<String> pairs = getNamePairs(this.names);
for (String name : pairs) {
if (text.contains(name)) {
foundNamePairs.add(name);
}
}
removeDuplicateFromList(foundNamePairs);
return foundNamePairs;
}

You can create a regex using the list of names and then use find to find the names. To ensure you don't have duplicates, you can check if the name is already in the list of found names. The code would look like this.
List<String> names = Arrays.asList("jon", "snow", "stark", "arya");
String text = "jon snow and Stark arya and again Jon Snow";
StringBuilder regexBuilder = new StringBuilder();
for (int i = 0; i < names.size(); i += 2) {
regexBuilder.append("(")
.append(names.get(i))
.append(" ")
.append(names.get(i + 1))
.append(")");
if (i != names.size() - 2) regexBuilder.append("|");
}
System.out.println(regexBuilder.toString());
Pattern compile = Pattern.compile(regexBuilder.toString(), Pattern.CASE_INSENSITIVE);
Matcher matcher = compile.matcher(text);
List<String> found = new ArrayList<>();
int start = 0;
while (matcher.find(start)) {
String match = matcher.group().toLowerCase();
if (!found.contains(match)) found.add(match);
start = matcher.end();
}
for (String s : found) System.out.println("found: " + s);
If you want to be case sensitive just remove the flag in Pattern.compile(). If all matches have the same capitalization you can omit the toLowerCase() in the while loop as well.
But make sure that the list contains a multiple of 2 as list elements (name and surname) as the for-loop will throw an IndexOutOfBoundsException otherwise. Also the order matters in my code. It will only find the name pairs in the order they occur in the list. If you want to have both orders, you can change the regex generation accordingly.
Edit: As it is unknown whether a name is a surname or name and which build a name/surname pair, the regex generation must be done differently.
StringBuilder regexBuilder = new StringBuilder("(");
for (int i = 0; i < names.size(); i++) {
regexBuilder.append("(")
.append(names.get(i))
.append(")");
if (i != names.size() - 1) regexBuilder.append("|");
}
regexBuilder.append(") ");
regexBuilder.append(regexBuilder);
regexBuilder.setLength(regexBuilder.length() - 1);
System.out.println(regexBuilder.toString());
This regex will match any of the given names followed by a space and then again any of the names.

extract text between matches using java

Here's my input text
1. INTRODUCTION
This is a test document. This document lines can span multiple lines.
This is another line.
2. PROCESS
This is a test process. This is another line.
3. ANOTHER HEADING
...
I want to extract text between the main titles, 1,2,3 and so on. I am using this regular expression to match the titles - ^[ ]{0,2}?[0-9]{0,2}\\.(.*)$
How do I extract text between matches?
EDIT
I tried using this code -
while(matcher.find()) {
}
if I look ahead for the next match's starting index in this while loop, it will change the state of the Matcher. How do I get the text between using String.substring? I will need to the end of the current match and the beginning of the next match to do a substring.

How do I extract text between matches?
Do you mean between 1. INTRODUCTION and 2. PROCESS and so on? If so, if the next line is not a "header" line, add the text to some buffer. If it is a header, add the buffer to a running list and then clear the buffer.
Something like (in pseudo code)
List<String> content
currentContent = ""
while line = readNextLine()
if not matched header
currentContent += line
else
// found new header, clear the content and add it to the list
if currentContent != ""
content.add(currentContent)
currentContent = ""
Edit: as one big string
// Split the lines by new lines
String[] bits = yourString.split("\\n");
String currentContent = ""; // Text between headers
List<String> content = new ArrayList<String>(); // Running list of text between headers
// Loop through each line
for (String bit : bits) {
Matcher m = yourPattern.match(bit);
if (m.matches()) {
// Found a header
if (currentContent.length() != 0) {
content.add(currentContent);
currentContent = "";
}
} else {
// Not a header, just append the line
currentContent += bit;
}
}
Something like that would work. I suppose you could do a complicated multi-line regex but this seems easier to me

How about this:
String text =
" 1. INTRODUCTION\n"
+ " This is a test document. This document lines can span multiple lines.\n"
+ " This is another line.\n"
+ " 2. PROCESS\n"
+ " This is a test process. This is another line.\n"
+ " 3. ANOTHER HEADING\n";
Pattern pat = Pattern.compile("^[ ]{0,2}?[0-9]{0,2}\\.(.*)$", Pattern.MULTILINE);
Matcher m = pat.matcher(text);
int start = 0;
while (m.find()) {
if (start < m.start()) {
System.out.println("*** paragraphs:");
System.out.println(text.substring(start, m.start()));
}
System.out.println("*** title:");
System.out.println(m.group());
start = m.end();
}
The results are:
*** title:
1. INTRODUCTION
*** paragraphs:
This is a test document. This document lines can span multiple lines.
This is another line.
*** title:
2. PROCESS
*** paragraphs:
This is a test process. This is another line.
*** title:
3. ANOTHER HEADING
You may wish to remove the newlines before and after the paragraphs.

Filter words from string

I want to filter a string.
Basically when someone types a message, I want certain words to be filtered out, like this:
User types: hey guys lol omg -omg mkdj*Omg*ndid
I want the filter to run and:
Output: hey guys lol - mkdjndid
And I need the filtered words to be loaded from an ArrayList that contains several words to filter out. Now at the moment I am doing if(message.contains(omg)) but that doesn't work if someone types zomg or -omg or similar.

Use replaceAll with a regex built from the bad word:
message = message.replaceAll("(?i)\\b[^\\w -]*" + badWord + "[^\\w -]*\\b", "");
This passes your test case:
public static void main( String[] args ) {
List<String> badWords = Arrays.asList( "omg", "black", "white" );
String message = "hey guys lol omg -omg mkdj*Omg*ndid";
for ( String badWord : badWords ) {
message = message.replaceAll("(?i)\\b[^\\w -]*" + badWord + "[^\\w -]*\\b", "");
}
System.out.println( message );
}

try:
input.replaceAll("(\\*?)[oO][mM][gG](\\*?)", "").split(" ")

Dave gave you the answer already, but I will emphasize the statement here. You will face a problem if you implement your algorithm with a simple for-loop that just replaces the occurrence of the filtered word. As an example, if you filter the word ass in the word 'classic' and replace it with 'butt', the resultant word will be 'clbuttic' which doesn't make any sense. Thus, I would suggest using a word list,like the ones stored in Linux under /usr/share/dict/ directory, to check if the word is valid or it needs filtering.
I don't quite get what you are trying to do.

I ran into this same problem and solved it in the following way:
1) Have a google spreadsheet with all words that I want to filter out
2) Directly download the google spreadsheet into my code with the loadConfigs method (see below)
3) Replace all l33tsp33k characters with their respective alphabet letter
4) Replace all special characters but letters from the sentence
5) Run an algorithm that checks all the possible combinations of words within a string against the list efficiently, note that this part is key - you don't want to loop over your ENTIRE list every time to see if your word is in the list. In my case, I found every combination within the string input and checked it against a hashmap (O(1) runtime). This way the runtime grows relatively to the string input, not the list input.
6) Check if the word is not used in combination with a good word (e.g. bass contains *ss). This is also loaded through the spreadsheet
6) In our case we are also posting the filtered words to Slack, but you can remove that line obviously.
We are using this in our own games and it's working like a charm. Hope you guys enjoy.
https://pimdewitte.me/2016/05/28/filtering-combinations-of-bad-words-out-of-string-inputs/
public static HashMap<String, String[]> words = new HashMap<String, String[]>();
public static void loadConfigs() {
try {
BufferedReader reader = new BufferedReader(new InputStreamReader(new URL("https://docs.google.com/spreadsheets/d/1hIEi2YG3ydav1E06Bzf2mQbGZ12kh2fe4ISgLg_UBuM/export?format=csv").openConnection().getInputStream()));
String line = "";
int counter = 0;
while((line = reader.readLine()) != null) {
counter++;
String[] content = null;
try {
content = line.split(",");
if(content.length == 0) {
continue;
}
String word = content[0];
String[] ignore_in_combination_with_words = new String[]{};
if(content.length > 1) {
ignore_in_combination_with_words = content[1].split("_");
}
words.put(word.replaceAll(" ", ""), ignore_in_combination_with_words);
} catch(Exception e) {
e.printStackTrace();
}
}
System.out.println("Loaded " + counter + " words to filter out");
} catch (IOException e) {
e.printStackTrace();
}
}
/**
* Iterates over a String input and checks whether a cuss word was found in a list, then checks if the word should be ignored (e.g. bass contains the word *ss).
* #param input
* #return
*/
public static ArrayList<String> badWordsFound(String input) {
if(input == null) {
return new ArrayList<>();
}
// remove leetspeak
input = input.replaceAll("1","i");
input = input.replaceAll("!","i");
input = input.replaceAll("3","e");
input = input.replaceAll("4","a");
input = input.replaceAll("#","a");
input = input.replaceAll("5","s");
input = input.replaceAll("7","t");
input = input.replaceAll("0","o");
ArrayList<String> badWords = new ArrayList<>();
input = input.toLowerCase().replaceAll("[^a-zA-Z]", "");
for(int i = 0; i < input.length(); i++) {
for(int fromIOffset = 1; fromIOffset < (input.length()+1 - i); fromIOffset++) {
String wordToCheck = input.substring(i, i + fromIOffset);
if(words.containsKey(wordToCheck)) {
// for example, if you want to say the word bass, that should be possible.
String[] ignoreCheck = words.get(wordToCheck);
boolean ignore = false;
for(int s = 0; s < ignoreCheck.length; s++ ) {
if(input.contains(ignoreCheck[s])) {
ignore = true;
break;
}
}
if(!ignore) {
badWords.add(wordToCheck);
}
}
}
}
for(String s: badWords) {
Server.getSlackManager().queue(s + " qualified as a bad word in a username");
}
return badWords;
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Split pdf into sections based on titles/bookmarks using regex - java

Related

Why does formatting a JList cell using HTML lead to the text being centered in the cell?

Length of String within tags in java

Java Regex to find in a Text all the possible pairs of a list

extract text between matches using java

Filter words from string

Categories

Resources