Wrote a function for reading text from a PDF document.
Used scala language, Selenium, PDFBox 2.0.1.
Below is the code:
enter code here
import org.openqa.selenium.firefox.{FirefoxBinary, FirefoxDriver, FirefoxProfile}
import org.apache.pdfbox.pdfparser.PDFParser
import org.apache.pdfbox.text.PDFTextStripper
import java.io.BufferedInputStream
def pdfreaddata {
driver.get("https://www.....pdf")
driver.manage.timeouts.implicitlyWait(50, TimeUnit.SECONDS)
val url: URL = new URL(driver.getCurrentUrl)
println(url)
val fileToParse: BufferedInputStream = new BufferedInputStream(url.openStream())
val parser: PDFParser = new PDFParser(fileToParse)
parser.parse()
val output: String = new PDFTextStripper().getText(parser.getPDDocument)
println("pdf Value" + output)
parser.getPDDocument.close()
driver.manage.timeouts.implicitlyWait(100, TimeUnit.SECONDS)
}
Showing error for PDFParser in val parser: PDFParser = new PDFParser(fileToParse)
Error message:
Cannot resolve constructor
Tried the code in Java too, getting same error.
You are using PDFBox version 2.x, however you are obviously following the docs for version 1.x . In 2.0 there is no such constructor. Some things have changed, including parsing. Follow the migration guide or fall back to 1.8, since it does look much more documented and with more material online.
Using pdfbox 1.8.12 solved the constructor issue. But even the pdf's was not password protected, it was showing as encrypted. Below is the final code using Scala to extract encrypted text from a pdf document. Might be useful for someone in future.
def pdfreaddata {
driver.get("https://www....combo.pdf")
driver.manage.timeouts.implicitlyWait(50, TimeUnit.SECONDS)
val url: URL = new URL(driver.getCurrentUrl)
println(url)
val fileToParse: BufferedInputStream = new BufferedInputStream(url.openStream())
val parser: PDFParser = new PDFParser(fileToParse)
parser.parse()
val cosDocument:COSDocument = parser.getDocument()
val pdDocument:PDDocument = new PDDocument(cosDocument)
if(pdDocument.isEncrypted()) {
val sdm: StandardDecryptionMaterial = new StandardDecryptionMaterial(PDF_OWNER_PASSWORD)//PDF_OWNER_PASSWORD =""
pdDocument.openProtection(sdm)
}
val output: String = new PDFTextStripper().getText(pdDocument)
println("pdf Value" + output)
parser.getPDDocument.close()
driver.manage.timeouts.implicitlyWait(100, TimeUnit.SECONDS)
}
}
Related
I have a method that already have an FilePart[TemporaryFile] and i will call another method to send a multi-part form data. This method is using scala play 2.4.X and i have to send it using ning method below:
def sendFile(file: FilePart[TemporaryFile]): Option[Future[Unit]] = {
val asyncHttpClient:AsyncHttpClient = WS.client.underlying
val postBuilder = asyncHttpClient.preparePost(s"${config.ocrProvider.host}")
val multiPartPost = postBuilder
.addBodyPart(new StringPart("access_token",s"${config.ocrProvider.accessToken}"))
.addBodyPart(new StringPart("typename",s"${config.ocrProvider.typeName}"))
.addBodyPart(new StringPart("action",s"${config.ocrProvider.actionUpload}"))
.addBodyPart(new FilePart(**expects java.io.File not FilePart**)
}
How can i take advantage of this parameter and send as java.io.File?
You need to write the content of file: FilePart[TemporaryFile] to disk and then use that file for constructing the new multipart request. You can see this example Scala File Upload
val tempFile = new File("/tmp/some/path")
file.ref.moveTo(tempFile)
val filePart = new FilePart(tempFile)
I'm aware that Scala uses Java sockets, but I don't quite understand the answers from questions that people have had with the same problem but strictly in Java.
Here is my code:
I am trying to send a jar file through a socket, but when I try to open the jar file from the other side, the file seems to be corrupted. How can I fix this?
Server:
object server extends App {
import java.net._
import java.io._
import scala.io._
import scala.io.Source
val server = new ServerSocket(9999)
//Master should ping the slave actor to request for jar file
while (true) {
val s = server.accept()
val in = new BufferedSource(s.getInputStream()).getLines()
val out = new PrintStream(s.getOutputStream())
val filename = "mapReduce.jar"
for (line <- Source.fromFile(filename, "ISO-8859-1").getLines) {
out.println(line)
// println(line)
}
out.flush()
s.close()
}
}
Along with the Client:
object client extends App {
import java.net._
import java.io._
import scala.io._
import java.util.jar._
val s = new Socket(InetAddress.getByName("localhost"), 9999)
lazy val in = new BufferedSource(s.getInputStream()).getLines()
val out = new PrintStream(s.getOutputStream())
out.println("Give me the jar file!")
out.flush()
val file = new File("testmapReduce.jar")
val bw = new BufferedWriter(new FileWriter(file))
while(in.hasNext) {
val buf = in.next()
bw.write(buf)
// println(buf)
}
s.close()
bw.close()
println("Done!")
val jar = new JarFile(file) //this part fails
}
Source, PrintStream etc, are intended to deal with text, not binary data. They convert the data on both read and write in accordance with the character set they are using ("iso-8859-1" in your case).
Do not use them to read/write binary data.
If you just need to send byte, don't bother with interpreting them:
val f = new FileInputStream(filename)
val bos = new BufferedOutputStream(out)
Stream.continually(f.read).takeWhile(_ != -1).foreach(bos.write)
f.close
bos.close
I am fiddling with a stockticker app. I am using Google's service. So, I read their page and parse the XML. I can iterate through the xml but the problem is Google puts the actual information inside the tag. So, for the latest price I would iterate to this: < last data="30.32" />. But I cannot read the actual data part. I tried using #data like the groovy api says, but it just comes back blank. Here is my code:
def stockTicket(params) {
def BASE_URL = "http://www.google.com/ig/api?stock="+params.url
def stock_url = BASE_URL
def url = stock_url.toURL().text
stock_url = urlMaker(stock_url)
def slurper = new XmlSlurper()
BufferedReader br = new BufferedReader(new InputStreamReader(stock_url.openStream()))
String strTemp = ""
strTemp = br.readLine()
def records = new XmlSlurper().parseText(url)
render records.xml_api_reply.finance.last.#data.text()
}
you just need
records.finance.last.#data
the slurper already points to the root node
I'm a Scala/Java noob, so sorry if this is a relatively easy solution--but I'm trying to access a model in an external file (an Apache Open NLP model), and not sure where I'm going wrong. Here's how you'd do it in Java, and here's what I'm trying:
import java.io._
val nlpModelPath = new java.io.File( "." ).getCanonicalPath + "/lib/models/en-sent.bin"
val modelIn: InputStream = new FileInputStream(nlpModelPath)
which works fine, but trying to instantiate an object based off the model in that binary file is where I'm failing:
val sentenceModel = new modelIn.SentenceModel // type SentenceModel is not a member of java.io.InputStream
val sentenceModel = new modelIn("SentenceModel") // not found: type modelIn
I've also tried a DataInputStream:
val file = new File(nlpModelPath)
val dis = new DataInputStream(file)
val sentenceModel = dis.SentenceModel() // value SentenceModel is not a member of java.io.DataInputStream
I'm not sure what I'm missing--maybe some method to convert the Stream to some binary object from which I can pull in methods? Thank you for any pointers.
The problem is that you're using wrong syntax (please, don't take it personal, but why don't you read some beginner java book or even just a tutorial first if you planning to stick with java or scala for some time?)
Code you would write in java
SentenceModel model = new SentenceModel(modelIn);
will look similar in scala:
val model: SentenceModel = new SentenceModel(modelIn)
// or just
val model = new SentenceModel(modelIn)
The problem you got with this syntax is that you forgot to import definition of SentenceModel so compiler simply has no clue what is SentenceModel.
Add
import opennlp.tools.sentdetect.SentenceModel
At the top of your .scala file and this will fix it.
I am evaluating Websphere MQ7. I am a traditionally a TibRV guy. One thing I do not like is the fact that the IBM java client libs require C++ libs in order to run. Is there anyway to run the IBM java client libs without requiring the C++ libs? e.g. is there a pure java client library for MQ ?
I have previously written a JMS client to MQSeries v6 (not your version, I know) without needing to install native libs. The only IBM libraries I required were titled:
com.ibm.mq-6.jar
com.ibm.mqbind.jar
com.ibm.mqjms-6.jar
According to this post they come with the client install. I assume you can install it once, then re-use the jars (any licensing issues and expert opinions aside).
EDIT: In response to your comment, here's the client code I hacked up. It is for reading messages from a queue and blatting them to files. It's written in Scala. I hope it helps somewhat.
import com.ibm.mq._
import java.text._
import java.io._
case class QueueDetails(hostname: String, channel: String,
port: Int, queueManager: String, queue: String)
class Reader(details: QueueDetails) {
def read = {
MQEnvironment.hostname = details.hostname
MQEnvironment.channel = details.channel
MQEnvironment.port = details.port
val props = new java.util.Hashtable[String, String]
props.put(MQC.TRANSPORT_PROPERTY, MQC.TRANSPORT_MQSERIES)
MQEnvironment.properties = props
val qm = new MQQueueManager(details.queueManager)
val options = MQC.MQOO_INPUT_AS_Q_DEF | MQC.MQOO_INQUIRE
val q = qm.accessQueue(details.queue, options, null, null, null)
val depth = q.getCurrentDepth
val indexFormat = new DecimalFormat(depth.toString.replaceAll(".", "0"))
def exportMessage(index: Int): Unit = {
if (index < depth) {
val msg = new MQMessage
q.get(msg, new MQGetMessageOptions)
val msgLength = msg.getMessageLength
val text = msg.readStringOfByteLength(msgLength)
val file = new File("message_%s.txt".format(indexFormat.format(index)))
val writer = new BufferedWriter(new FileWriter(file))
writer.write(text)
writer.close
println(file.getAbsolutePath)
exportMessage(index + 1)
}
}
exportMessage(0)
q.close
qm.disconnect
}
}