I use Apache PDFBox to parse text from pdf file. I tried to get a line after a specific line.
PDDocument document = PDDocument.load(new File("my.pdf"));
if (!document.isEncrypted()) {
PDFTextStripper stripper = new PDFTextStripper();
String text = stripper.getText(document);
System.out.println("Text from pdf:" + text);
} else{
log.info("File is encrypted!");
}
document.close();
Sample:
Sentence 1, nth line of file
Needed line
Sentence 3, n+2th line of file
I tried to get all the lines from file in an array, but it is unstable, because unable to filter to a specific text. It is problem also in second solution, that is why I am looking for a PDFBox based solution.
Solution 1:
String[] lines = myString.split(System.getProperty("line.separator"));
Solution 2:
String neededline = (String) FileUtils.readLines(file).get("n+2th")
In fact, the source code for the PDFTextStripper class uses the exact same line ending as you, so your first attempt is as close to correct as possible using PDFBox.
You see, the PDFTextStripper getText method calls the writeText method which just writes to an output buffer line by line with the writeString method in the exact same way as you have already tried. The result returned from this method is the buffer.toString().
Therefore, given a well formatted PDF, it would seem the question you are really asking is how to filter an array for specific text. Here are some ideas:
First, you captures lines in an array like you said.
import java.io.File;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
public class Main {
static String[] lines;
public static void main(String[] args) throws Exception {
PDDocument document = PDDocument.load(new File("my2.pdf"));
PDFTextStripper stripper = new PDFTextStripper();
String text = stripper.getText(document);
lines = text.split(System.getProperty("line.separator"));
document.close();
}
}
Here's a method to get a complete String by any line number index, easy:
// returns a full String line by number n
static String getLine(int n) {
return lines[n];
}
Here's a linear search method that finds a string match and returns the first line number where found.
// searches all lines for first line index containing `filter`
static int getLineNumberWithFilter(String filter) {
int n = 0;
for(String line : lines) {
if(line.indexOf(filter) != -1) {
return n;
}
n++;
}
return -1;
}
With the above, it possible to get only the line number for your matched search:
System.out.println(getLine(8)); // line 8 for example
Or, the entire String line that contains your matched search:
System.out.println(lines[getLineNumberWithFilter("Cat dog mouse")]);
This all seems pretty straight forward and works only under the assumption that lines can be split into arrays by the line separator. If the solution is not as simple as the above ideas, I believe the source of your problem may not be in your implementation with PDFBox but rather with the PDF source you are trying to text mine.
Here's a link to a tutorial that also does what you are trying to do:
https://www.tutorialkart.com/pdfbox/extract-text-line-by-line-from-pdf/
Again, same approach...
Related
I'm writing a bidi String to an MS Word file using Apache POI after wrapping it with the sequence
aString = "\u202E" + aString + "\u202C";
The text renders correctly in the file, and reads fine when I retrieve the string again. But if I modify the file in anyway, suddenly, reading that string returns true with isBlank().
Thank you in advance for any suggestions/help!
When Microsoft Word stores bidirectional text in it's Office Open XML *.docx format, then it sometimes uses special text run elements w:bdo (bi directional orientation). Apache poi does not read those elements until now. So if a XWPFParagraph contains such elements, then paragraph.getText() will return an empty string.
One could using org.apache.xmlbeans.XmlCursor to really get all text from all XWPFParagraphs like so:
import java.io.FileInputStream;
import org.apache.poi.xwpf.usermodel.*;
import org.apache.xmlbeans.XmlCursor;
public class ReadWordParagraphs {
static String getAllTextFromParagraph(XWPFParagraph paragraph) {
XmlCursor cursor = paragraph.getCTP().newCursor();
return cursor.getTextValue();
}
public static void main(String[] args) throws Exception {
XWPFDocument document = new XWPFDocument(new FileInputStream("WordDocument.docx"));
for (XWPFParagraph paragraph : document.getParagraphs()) {
System.out.println(paragraph.getText()); // will not return text in w:bdo elements
System.out.println(getAllTextFromParagraph(paragraph)); // will return all text content of paragraph
}
}
}
IN a csv file that I have a record that renders like this:
,"SKYY SPA MARTINI
2 oz. SKYY Vodka
Fresh cucumber
Fresh mint
Splash of simple syrup
Muddle cucumber & mint with syrup.
Add SKYY Vodka and shake with ice.
Strain into a chilled martini glass.
Garnish with a fresh mint sprig and cucumber slice.",
with each line ending with a LF carriage return.
I thought that this would be treated as a string and the carriage returns wouldn't be treated as new lines, but this isn't the case, and is breaking my script. Is there a way to have the reader only have line breaks parsed if they're not flanked by quotes? I'm currently using this as my code, couldn't find a setting for the tokenizer that would allow me to perform this action.
// instantiate description line mapper
DelimitedLineTokenizer lineTokenizer = new DelimitedLineTokenizer();
DefaultLineMapper<LCBOProduct> lineMapper = new DefaultLineMapper<>();
lineMapper.setLineTokenizer(lineTokenizer);
lineMapper.setFieldSetMapper(fieldSetMapper);
// set description line mapper
reader.setLineMapper(lineMapper);
return reader;
Inspired by this CSV regex post, I have written a quick-and-dirty method for doing this:
public static void main(String[] args) {
String line = "\"BEEP\",\"BOOP\",\"TWO SHOTS\rOF VODKA\"\r\"BOOP\",\"BEEP\",\"LEMON\rWEDGES\"";
String quote = "\"";
String splitter = "\r";
String delimiter = ",";
parse(line, delimiter, quote, splitter);
}
public static void parse(String data, String delimiter, String quote, String splitter) {
String regex = splitter+"(?=(?:[^"+quote+"]*\"[^"+quote+"]*\")*[^"+quote+"]*$)";
String[] lines = data.split(regex, -1);
List<String[]> records = new ArrayList<String[]>();
for(String line : lines) {
records.add(line.split(delimiter, -1));
}
for(String[] line : records) {
for(String record : line) {
System.out.println("RECORD: " + record); //do whatever
}
}
}
Of course, considering the large size of some CSV files, you will need to chug along with a StringBuilder and likely use myStringBuilder.toString().split(regex, -1); for the parse method.
This is likely not the Spring way of doing things. But as Jim Garrison commented, this is an edge case that I'm not sure if Spring has ways of solving.
A more complex regex may be required if the records start using other nasty characters (commas, quotes, etc.). I don't know what the source of these records could be, but some sanitizing may be in order before splitting the file.
I need to read a file, completely and split the strings inside the file and store it in a variable using Java
See below example, my text file contains
devarajan 1000210 08754540275 600019
ramesh 1000210 08754540275 600019
udhay 1000210 08754540275 600019
I tired using string position but it is not working out.
Please find attached sample file as well. Regards
My Code:
public class Program {
public static void main(String[] args) {
String line = "devarajan 1000210 08754540275 600019 ";
String[] words = line.split("\\W+");
for (String word : words) {
System.out.println(word);
}
}
}
Output:
devarajan 1000210 08754540276
My file will contain the list of string 10-10 position will be name 20-30 position will be empid 30-40 will phone number. so while i used the previous snippet i am getting blank spaces "devarajan" " 1000210".. i should avoid that blank spaces.
In turn my code is splitting up as soon as it encounters blank space, instead of position
#Twelve, # Kick : I am getting the output as follows for your snippet
but imagine if i have a space in my name ex: "twelve dollar" instead of "
twelvedollar", then the name will get split and stored in different array position. and that is the reason, i have asked whether it is possible to split the string based on the position
just one way to do it ..
try {
Scanner inFile = new Scanner(new File("myInputFile.txt"));
String[] data;
ArrayList<String[]> arr = new ArrayList<String[]>();
while (inFile.hasNext()) {
data = inFile.nextLine().split("\\s+"); // or split("\t") if using tabs
System.out.println(Arrays.toString(data));
arr.add(data);
}
}
catch (FileNotFoundException fe) {
fe.printStackTrace();
}
Using jcsv I'm trying to parse a CSV to a specified type. When I parse it, it says length of the data param is 1. This is incorrect. I tried removing line breaks, but it still says 1. Am I just missing something in plain sight?
This is my input string csvString variable
"Symbol","Last","Chg(%)","Vol",
INTC,23.90,1.06,28419200,
GE,26.83,0.19,22707700,
PFE,31.88,-0.03,17036200,
MRK,49.83,0.50,11565500,
T,35.41,0.37,11471300,
This is the Parser
public class BuySignalParser implements CSVEntryParser<BuySignal> {
#Override
public BuySignal parseEntry(String... data) {
// console says "Length 1"
System.out.println("Length " + data.length);
if (data.length != 4) {
throw new IllegalArgumentException("data is not a valid BuySignal record");
}
String symbol = data[0];
double last = Double.parseDouble(data[1]);
double change = Double.parseDouble(data[2]);
double volume = Double.parseDouble(data[3]);
return new BuySignal(symbol, last, change, volume);
}
}
And this is where I use the parser (right from the example)
CSVReader<BuySignal> cReader = new CSVReaderBuilder<BuySignal>(new StringReader( csvString)).entryParser(new BuySignalParser()).build();
List<BuySignal> signals = cReader.readAll();
jcsv allows different delimiter characters. The default is semicolon. Use CSVStrategy.UK_DEFAULT to get to use commas.
Also, you have four commas, and that usually indicates five values. You might want to remove the delimiters off the end.
I don't know how to make jcsv ignore the first line
I typically use CSVHelper to parse CSV files, and while jcsv seems pretty good, here is how you would do it with CVSHelper:
Reader reader = new InputStreamReader(new FileInputStream("persons.csv"), "UTF-8");
//bring in the first line with the headers if you want them
List<String> firstRow = CSVHelper.parseLine(reader);
List<String> dataRow = CSVHelper.parseLine(reader);
while (dataRow!=null) {
...put your code here to construct your objects from the strings
dataRow = CSVHelper.parseLine(reader);
}
You shouldn't have commas at the end of lines. Generally there are cell delimiters (commas) and line delimiters (newlines). By placing commas at the end of the line it looks like the entire file is one long line.
I have a program all what I need it to do is to extract URLs from a text file and saves them into another text file. The code calls ExtractHTML2.getURL2(url,input); which is simply extract the HTML code for a given link (which works correctly & no need to include its code here).
EDIT: The code parse number of pages, on each page, it save its html code in text file, then parse this text file, to extract 10 links.
Now, the following code suppose to parse the extracted HTML code and extract the URLs. This does not work with me. It does not extract any thing.
CODE EDITED:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import java.io.*;
public class ExtractLinks2 {
public static void getLinks2(String url, int pages) throws IOException {
{
Document doc;
Element link;
String elementLink=null;
int linkId=1; //represent the Id of the href tag inside the HTML code
//The file that contains the extracted HTML code for the web page.
File input = new File
("extracted.txt");
//To write the extracted links
FileWriter fstream = new FileWriter
("links.txt");
BufferedWriter out = new BufferedWriter(fstream);
// Loop to traverse the pages
for (int z=1; z<=pages; z++)
{
/*get the HTML code for that page and save
it in input (extracted.txt)*/
ExtractHTML2.getURL2(url, input);
//Using parse function from JSoup library
doc = Jsoup.parse(input, "UTF-8");
//Loop for 10 times to extract 10 links per page
for(int e=1; e<=10; e++)
{
link = doc.getElementById("link-"+linkId); //the href tag Id
System.out.println("This is link no."+linkId);
elementLink=link.absUrl("href");
//write the extracted link to text file
out.write(elementLink);
out.write(","); //add a comma
linkId++;
} //end for loop
linkId=1; //reset the linkId
}//end for loop
out.close();
} //end the getLinks function
} //end IOExceptions
} //end ExtractDNs class
As I said, my program does not extract the URLs. I have doubt in my syntax for Jsoup.parse. Reference to: http://jsoup.org/cookbook/input/load-document-from-file there is optional third argument that I ignored it as I think it is not needed in my case. I need to extract from text file not html page.
My program is able to extract the href tag text if I typed: eURL =elem.text(); but I don’t need the text, I need the URL itself, e.g: If I have the following:
<a id="link-1" class="yschttl spt" href="/r/_ylt=A7x9QXi_UOlPrmgAYKpLBQx.;
_ylu=X3oDMTBzcG12Mm9lBHNlYwNzcgRwb3MDMTEEY29sbwNpcmQEdnRpZAM-/SIG=1329l4otf/
EXP=1340719423/**http%3a//www.which.co.uk/technology/computing/guides/how-to-buy
-the-best-laptop/" data-bk="5040.1">How to <b>buy</b> the best <b>laptop</b>
- <b>Laptop</b> <wbr />reviews - Computing ...</a>
I only need "www.which.co.uk" or even better "which.co.uk" if there is a way to do that.
Why the above program does not extract URLs and how to correct the problem ?
The problem was in this line:
link = doc.getElementById("link-"+linkId);
It should be:
link = doc.getElementById("link-" + Integer.toString(linkId));
Since linkId is integer, and getElementById takes string as parameter. So, I had to convert the Id to int first, so the input for the getElementById becomes in the form: link-1, link-2, etc.