Calling a Java class with PHP's exec function - java

I need to split a text into sentences, and I am trying to use Stanford Core NLP. I have downloaded the library. Since its a Java library I am using PHP's exec command (please see below) to call it. My PHP script works well and I can parse a text into sentences. Currently, the script needs an input file to be parsed. My question is if I can use a PHP string variable instead of input .txt file. It will be very convenient for me since I will be using mysql db to retrieve text/string. If it is not possible, then I would need to create a corresponding text file for the command line input. Any feedback you provide will greatly be appreciated.
Here is my small PHP script
$text = "Maria and Ted Bobola grow sweet corn. But with little corn to harvest because of a plant-withering drought, the Bobolas were forced to buy corn from Georgia to supply their produce and flower shop just outside Dover. And most years, the strawberry picking season runs three to five weeks, but not this year, store manager Dee Chambers said. ``It was so hot and dry that even though we irrigated, we had only two weeks in the season,'' she said. They did not recoup $70,000 they paid for strawberry plants last year in this year's harvest.";
$parser = "stanford-corenlp-3.5.0.jar";
$class = "edu.stanford.nlp.process.DocumentPreprocessor";
$input = "sample.txt";
$output = "output.txt";
if (exec( "java -cp $parser -Xmx2g $class -file $input", $result))
{
echo "success";
}
else
{
echo "failure";
}
// Optional
echo '<pre>' . print_r($result, true);
What I am trying to do here is replacing $input (i.e. txt file) with $text (i.e. a php variable).

Related

Java call from Python without loading classpath

I am making a Java jar file call from Python.
def extract_words(file_path):
"""
Extract words and bounding boxes
Arguments:
file_path {[str]} -- [Input file path]
Returns:
[Document]
"""
extractor = PDFBoxExtractor(file_path=file_path,jar_path="external/pdfbox-app-2.0.15.jar",class_path="external")
document = extractor.run()
return document
And somewhere:
pipe = subprocess.Popen(['java',
'-cp',
'.:%s:%s' %
(self._jar_path,
self._class_path) ,
'PrintTextLocations',
self._file_path],
stdout=subprocess.PIPE)
output = pipe.communicate()[0].decode()
This is working fine. But the problem is the jar is heavy and when I have to call this multiple times in a loop, it takes 3-4 seconds to load the jar file each time. If I run this in a loop for 100 iterations, it adds 300-400 seconds to the process.
Is there any way to keep the classpath alive for java and not load jar file every time? Whats the best way to do it in time optimised manner?
You can encapsulate your PDFBoxExtractor in a class my making it a class member. Initialize the PDFBoxExtractor in the constructor of the class. Like below:
class WordExtractor:
def __init__(self):
self.extractor = PDFBoxExtractor(file_path=file_path,jar_path="external/pdfbox-app-2.0.15.jar",class_path="external")
def extract_words(self,file_path):
"""
Extract words and bounding boxes
Arguments:
file_path {[str]} -- [Input file path]
Returns:
[Document]
"""
document = self.extractor.run()
return document
Next step would be to create instance of WordExtractor class outside the loop.
word_extractor = WordExtractor()
#your loop would go here
while True:
document = word_extractor.extract_words(file_path);
This is just example code to explain the concept. You may tweak it the way you want as per your requirement.
Hope this helps !

IP as Linux array element throws UnknownHostException but as constant works

I have the following script in the directory /home/test/javacall that parses csv of IP pair , invokes a sh file that calls an executable jar to get output from these IPs.
In the below code ip1=${IPArray[0]} throws UnknownHostException from java.
But If I use the ip directly ip1="10.10.10.10" java code works fine. I did System.out.println from java and I got the same IP displayed in both cases. But in the case of ip1=${IPArray[0]} only, I get the exception.
#!/bin/bash
INPUT="IPPairs.csv"
array=()
while IFS="," read var1 var2 ; do
echo $var1 $var2
pairString="$var1***$var2"
array+=("$pairString")
done < $INPUT
for i in "${array[#]}" ; do
echo $i
IPString=$(echo $i | tr '***' ' ')
read -ra IPArray <<< "$IPString"
ip1=${IPArray[0]}
#ip1="10.10.10.10"
ip2=${IPArray[1]}
source /home/test/javacall/javacmd.sh "$ip1" "/home/test/javacall/out.txt" "show running-config all-properties"
done
Exception:
com.jcraft.jsch.JSchException: java.net.UnknownHostException: 10.10.10.10
at com.jcraft.jsch.Util.createSocket(Util.java:349)
at com.jcraft.jsch.Session.connect(Session.java:215)
at com.jcraft.jsch.Session.connect(Session.java:183)
That string (357\273\277) indicates that your csv file is encoded with a Byte-Order Mark (BOM) at the front of the file. The read command is not interpreting the BOM as having special meaning, just passing on the raw characters, so you see them as part of your output.
Since you didn't indicate how your source file is generated, you may be able to adjust the settings on that end to prevent writing the BOM, which is optional in many cases. Alternatively, you can work around it various ways on the script side. These questions both offer some examples:
How can I remove the BOM from a UTF-8 file?
Cygwin command not found bad characters found in .bashrc 357\273\277
But honestly, if you just follow Charles Duffy's advice and run your file through dos2unix before parsing it, it should clean this up for you automatically. i.e.:
...
array=()
dos2unix $INPUT
while IFS="," read var1 var2 ; do
...
Or, building on Charles' version:
#!/usr/bin/env bash
case $BASH_VERSION in ''|[123].*) echo "ERROR: Bash 4.0+ needed" >&2; exit 1;; esac
INPUT="IPPairs.csv"
declare -A pairs=( )
dos2unix $INPUT
while IFS=$',\r' read -r var1 var2 _ ; do
pairs[$var1]=$var2
done <"$INPUT"
for ip1 in "${!pairs[#]}"; do
ip2=${pairs[$ip1]}
# Using printf %q causes nonprintable characters to be visibly shown
printf 'Processing pair: %q and %q\n' "$ip1" "$ip2" >&2
done
Do note that running dos2unix in your script is not necessarily the best approach, as the file only needs to be converted once. Generally speaking, it shouldn't hurt anything, especially with such a small file. Nonetheless, a better approach would be to run dos2unix as part of whatever process pushes your csv to the server, and keep it out of this script.
System.out.println() only shows visible characters.
If your input file contains DOS newlines, System.out.println() won't show them, but they'll still be present in your command line, and parsed as part of the IP address to connect to, causing an UnknownHostException. Converting it to a UNIX text file, as with dos2unix, or using :set fileformat=unix in vim, is typically the quickest way to fix this.
BTW, if you don't need ordering retained, an associative array is typically a more appropriate data structure to use to store pairs:
#!/usr/bin/env bash
case $BASH_VERSION in ''|[123].*) echo "ERROR: Bash 4.0+ needed" >&2; exit 1;; esac
declare -A pairs=( )
while IFS=$',\r' read -r var1 var2 _ ; do
pairs[$var1]=$var2
done <"$input"
for ip1 in "${!pairs[#]}"; do
ip2=${pairs[$ip1]}
# Using printf %q causes nonprintable characters to be visibly shown
printf 'Processing pair: %q and %q\n' "$ip1" "$ip2" >&2
done
In the above, using IFS=$',\r' prevents LF characters (from the "CRLF" sequence that makes up a DOS newline) from becoming either part of var1 or var2. (Adding an _ placeholder variable to consume any additional content in a given line of the file adds extra insurance towards this point).

How to specify the column coordinates in tabula command line

I want table data from PDF and I am using below command to get table data
java -jar tabula-java.jar -a 301.95,14.85,841.0500000000001,695.25 -t example.pdf
But in this, two column data get mixed in some rows,
so I want to specify column coordinates for getting the perfect data,
but I don't know how to get column coordinate,
so anyone can guide me with perfect command would be helpful.
Thanks in advance!
You can specify the column coordinates using the -c or --columns parameter. The coordinates you specify will be the coordinates of the delineators between columns. So if one column goes from 10.5 to 13.5 and the next column goes from 13.5 to 17.5 then you only list 13.5. You will also need to turn guess off. You didn't provide an example pdf so I can't provide you with the correct coordinates but your command would look something like this:
java -jar tabula-java.jar -a 301.95,14.85,841.0500000000001,695.25 -c 15.7,17.3,19.2,33.2,70.1,100.7,200.6,300.7 -t example.pdf -g False
You can read more about the different options for getting your command just right from the help command:
$ java -jar target/tabula-1.0.1-jar-with-dependencies.jar --help
usage: tabula [-a <AREA>] [-b <DIRECTORY>] [-c <COLUMNS>] [-d] [-f
<FORMAT>] [-g] [-h] [-i] [-l] [-n] [-o <OUTFILE>] [-p <PAGES>] [-r]
[-s <PASSWORD>] [-t] [-u] [-v]
Tabula helps you extract tables from PDFs
-a,--area <AREA> Portion of the page to analyze
(top,left,bottom,right). Example: --area
269.875,12.75,790.5,561. Default is entire
page
-b,--batch <DIRECTORY> Convert all .pdfs in the provided directory.
-c,--columns <COLUMNS> X coordinates of column boundaries. Example
--columns 10.1,20.2,30.3
-d,--debug Print detected table areas instead of
processing.
-f,--format <FORMAT> Output format: (CSV,TSV,JSON). Default: CSV
-g,--guess Guess the portion of the page to analyze per
page.
-h,--help Print this help text.
-i,--silent Suppress all stderr output.
-l,--lattice Force PDF to be extracted using lattice-mode
extraction (if there are ruling lines
separating each cell, as in a PDF of an Excel
spreadsheet)
-n,--no-spreadsheet [Deprecated in favor of -t/--stream] Force PDF
not to be extracted using spreadsheet-style
extraction (if there are no ruling lines
separating each cell)
-o,--outfile <OUTFILE> Write output to <file> instead of STDOUT.
Default: -
-p,--pages <PAGES> Comma separated list of ranges, or all.
Examples: --pages 1-3,5-7, --pages 3 or
--pages all. Default is --pages 1
-r,--spreadsheet [Deprecated in favor of -l/--lattice] Force
PDF to be extracted using spreadsheet-style
extraction (if there are ruling lines
separating each cell, as in a PDF of an Excel
spreadsheet)
-s,--password <PASSWORD> Password to decrypt document. Default is empty
-t,--stream Force PDF to be extracted using stream-mode
extraction (if there are no ruling lines
separating each cell)
-u,--use-line-returns Use embedded line returns in cells. (Only in
spreadsheet mode.)
-v,--version Print version and exit.

How does the following code in Pig (in hadoop ) work using Java regular expressions?

I have a CSV file containing the following data:
396124436476092000,Think about the life you livin but don't think so hard it hurts Life is truly a gift, but at the same it is a curse,Obey_Jony09
396124436740317184,"“#BleacherReport: Halloween has given us this amazing Derrick Rose photo (via #amandakaschube, #ScottStrazzante) http://t.co/tM0wEugZR1” yes",Colten_stamkos
I have written the following code in PigLatin to input data to alias B using delimiters in REGEX_EXTRACT_ALL. This command outputs all data represented by (.*)
A = load '/user/pig/tweets' as (line);
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(line,'(.*)[,”:-](.*)[“,:-](.*)')) AS (tweetid:long,msg:chararray,userid:chararray);
So want to know how the regex function works with the expression
'(.*)[,”:-](.*)[“,:-](.*)'
to split data into the schema (tweetid,msg,userid)

space in path using java

Hi, I have a big problem. I'm making a java program and I have to call an exe file in a folder that have whitespace. This program also has 2 arguments that always have whitspace in the path.
Example:
C:\Users\Program File\convert image\convert.exe C:\users\image exe\image.jpeg C:\Users\out put\out.bmp
I have to do this in Windows but i want generalize it for every OS.
My code is:
Runtime run = Runtime.getRuntime();<br/>
String path_current = System.getProperty("user.dir");<br/>
String [] uno = new String[]{"cmd","/c",path_current+"\\\convert\\\convert.exe",path_current+"\\\f.jpeg", path_current+"\\\fr.bmp"};<br/>
Process proc2 = run.exec(uno);<br/>
proc2.waitFor();<br/>
This does not work. I tried removing the String array and inserting a simple String with "\"" before and after the path but that didn't work. How do I resolve this?
you may want to use :
http://commons.apache.org/io/api-1.4/org/apache/commons/io/FilenameUtils.html#separatorsToSystem(java.lang.String)
see also this answer :
Is there a Java utility which will convert a String path to use the correct File separator char?
Remove "cmd" and "/c", and use a single forward slash instead of your triple backslaches.

Categories