I have tens of thousands files in dir demotxt like:
demotxt/
aa.txt
this is aaa1
this is aaa2
this is aaa3
bb.txt
this is bbb1
this is bbb2
this is bbb3
this is bbb4
cc.txt
this is ccc1
this is ccc2
I would like to efficiently make a WordCount for each .txt in this dir with Spark2.4 (scala or python)
# target result is:
aa.txt: (this,3), (is,3), (aaa1,1), (aaa2,1), (aaa3,1)
bb.txt: (this,3), (is,3), (bbb1,1), (bbb2,1), (bbb3,1)
cc.txt: (this,3), (is,3), (ccc1,1), (ccc2,1), (ccc3,1)
code maybe like?
def dealWithOneFile(path2File):
res = wordcountFor(path2File)
saveResultToDB(res)
sc.wholeTextFile(rooDir).map(dealWithOneFile)
Seems using sc.textFile(".../demotxt/") spark will load all files which may cause memory issues,also it treats all files as one which is not expected.
So I wonder how should I do this? Many Thanks!
Here is an approach. Can be with DF or RDD. Here I show RDD using Databricks, as you do not state. Scala as well.
It's hard to explain, but works. Try some input.
%scala
val paths = Seq("/FileStore/tables/fff_1.txt", "/FileStore/tables/fff_2.txt")
val rdd = spark.read.format("text").load(paths: _*).select(input_file_name, $"value").as[(String, String)].rdd
val rdd2 = rdd.flatMap(x=>x._2.split("\\s+").map(y => ((x._1, y), 1)))
val rdd3 = rdd2.reduceByKey(_+_).map( { case (x, y) => (x._1, (x._2, y)) } )
rdd3.collect
val rdd4 = rdd3.groupByKey()
rdd4.collect
Related
We write our jenkins pipeline using groovy script. Is there any way to identify the folder size or file size.
Our goal is to identify size of two zip files and calculate the difference between them.
I tried below code but its not working.
stage('Calculate Opatch size')
{
def sampleDir = new File('${BuildPathPublishRoot}')
def sampleDirSize = sampleDir.directorySize()
echo sampleDirSize
}
Getting below error :-
hudson.remoting.ProxyException: groovy.lang.MissingMethodException: No signature of method: java.io.File.directorySize() is applicable for argument types: () values: []
Possible solutions: directorySize()
at org.jenkinsci.plugins.scriptsecurity.sandbox.groovy.SandboxInterceptor.onMethodCall(SandboxInterceptor.java:154)
Here's what worked for me. Grab all the files in a directory and sum the lengths.
Please note that you'll need to use quotes (") in order for string interpolation to work, i.e. "${BuildPathPublishRoot}" places the value of the BuildPathPublishRoot variable into the string, whereas '${BuildPathPublishRoot}' is taken literally to be the directory name.
workspaceSize = directorySize("${BuildPathPublishRoot}")
/** Computes bytes in the directory*/
public def directorySize(directory){
long bytes = 0
directory = (directory?:'').replace('\\','/')
directory = (directory =='') ? '' : (directory.endsWith('/') ? directory : "${directory}/")
def files=findFiles(glob: "${directory}*.*")
for (file in files) {
if (!file.isDirectory()){
bytes += file.length
}
}
return bytes
}
I am using below function to list the directories. It works in Azure databricks but when I am adding in IntelliJ project code, it is not able to resolve "union" keyword. Do I need import anything here?
def listLeafDirectories(path: String): Array[String] =
dbutils.fs.ls(path).map(file => {
// Work around double encoding bug
val path = file.path.replace("%25", "%").replace("%25", "%")
if (file.isDir) listLeafDirectories(path)
else Array[String](path.substring(0,path.lastIndexOf("/")+1))
}).reduceOption(_ union _).getOrElse(Array()).distinct
ADB Notebook successful execution:
Sorry, my mistake.
In the screenshot provided highlighting error, I have recurse flag which I was not passing to the function (in same function).
So below one worked:
def listDirectories(dir: String, recurse: Boolean): Array[String] = {
dbutils.fs.ls(dir).map(file => {
val path = file.path.replace("%25", "%").replace("%25", "%")
if (file.isDir) listDirectories(path,recurse)
else Array[String](path.substring(0, path.lastIndexOf("/")+1))
}).reduceOption(_ union _).getOrElse(Array()).distinct
}
I've run into an issue with attempting to parse json in my spark job. I'm using spark 1.1.0, json4s, and the Cassandra Spark Connector, with DSE 4.6. The exception thrown is:
org.json4s.package$MappingException: Can't find constructor for BrowserData org.json4s.reflect.ScalaSigReader$.readConstructor(ScalaSigReader.scala:27)
org.json4s.reflect.Reflector$ClassDescriptorBuilder.ctorParamType(Reflector.scala:108)
org.json4s.reflect.Reflector$ClassDescriptorBuilder$$anonfun$6.apply(Reflector.scala:98)
org.json4s.reflect.Reflector$ClassDescriptorBuilder$$anonfun$6.apply(Reflector.scala:95)
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
My code looks like this:
case class BrowserData(navigatorObjectData: Option[NavigatorObjectData],
flash_version: Option[FlashVersion],
viewport: Option[Viewport],
performanceData: Option[PerformanceData])
.... other case classes
def parseJson(b: Option[String]): Option[String] = {
implicit val formats = DefaultFormats
for {
browserDataStr <- b
browserData = parse(browserDataStr).extract[BrowserData]
navObject <- browserData.navigatorObjectData
userAgent <- navObject.userAgent
} yield (userAgent)
}
def getJavascriptUa(rows: Iterable[com.datastax.spark.connector.CassandraRow]): Option[String] = {
implicit val formats = DefaultFormats
rows.collectFirst { case r if r.getStringOption("browser_data").isDefined =>
parseJson(r.getStringOption("browser_data"))
}.flatten
}
def getRequestUa(rows: Iterable[com.datastax.spark.connector.CassandraRow]): Option[String] = {
rows.collectFirst { case r if r.getStringOption("ua").isDefined =>
r.getStringOption("ua")
}.flatten
}
def checkUa(rows: Iterable[com.datastax.spark.connector.CassandraRow], sessionId: String): Option[Boolean] = {
for {
jsUa <- getJavascriptUa(rows)
reqUa <- getRequestUa(rows)
} yield (jsUa == reqUa)
}
def run(name: String) = {
val rdd = sc.cassandraTable("beehive", name).groupBy(r => r.getString("session_id"))
val counts = rdd.map(r => (checkUa(r._2, r._1)))
counts
}
I use :load to load the file into the REPL, and then call the run function. The failure is happening in the parseJson function, as far as I can tell. I've tried a variety of things to try to get this to work. From similar posts, I've made sure my case classes are in the top level in the file. I've tried compiling just the case class definitions into a jar, and including the jar in like this: /usr/bin/dse spark --jars case_classes.jar
I've tried adding them to the conf like this: sc.getConf.setJars(Seq("/home/ubuntu/case_classes.jar"))
And still the same error. Should I compile all of my code into a jar? Is this a spark issue or a JSON4s issue? Any help at all appreciated.
I'm trying to link classes from the JDK into the scaladoc-generated doc.
I've used the -doc-external-doc option of scaladoc 2.10.1 but without success.
I'm using -doc-external-doc:/usr/lib/jvm/java-7-openjdk-amd64/jre/lib/rt.jar#http://docs.oracle.com/javase/7/docs/api/, but I get links such as index.html#java.io.File instead of index.html?java/io/File.html.
Seems like this option only works for scaladoc-generated doc.
Did I miss an option in scaladoc or should I fill a feature request?
I've configured sbt as follows:
scalacOptions in (Compile,doc) += "-doc-external-doc:/usr/lib/jvm/java-7-openjdk-amd64/jre/lib/rt.jar#http://docs.oracle.com/javase/7/docs/api"
Note: I've seen the Opts.doc.externalAPI util in the upcoming sbt 0.13. I think a nice addition (not sure if it's possible) would be to pass a ModuleID instead of a File. The util would figure out which file corresponds to the ModuleID.
I use sbt 0.13.5.
There's no out-of-the-box way to have the feature of having Javadoc links inside scaladoc. And as my understanding goes, it's not sbt's fault, but the way scaladoc works. As Josh pointed out in his comment You should report to scaladoc.
There's however a workaround I came up with - postprocess the doc-generated scaladoc so the Java URLs get replaced to form proper Javadoc links.
The file scaladoc.sbt should be placed inside a sbt project and whenever doc task gets executed, the postprocessing via fixJavaLinksTask task kicks in.
NOTE There are lots of hardcoded paths so use it with caution (aka do the polishing however you see fit).
import scala.util.matching.Regex.Match
autoAPIMappings := true
// builds -doc-external-doc
apiMappings += (
file("/Library/Java/JavaVirtualMachines/jdk1.8.0_11.jdk/Contents/Home/jre/lib/rt.jar") ->
url("http://docs.oracle.com/javase/8/docs/api")
)
lazy val fixJavaLinksTask = taskKey[Unit](
"Fix Java links - replace #java.io.File with ?java/io/File.html"
)
fixJavaLinksTask := {
println("Fixing Java links")
val t = (target in (Compile, doc)).value
(t ** "*.html").get.filter(hasJavadocApiLink).foreach { f =>
println("fixing " + f)
val newContent = javadocApiLink.replaceAllIn(IO.read(f), fixJavaLinks)
IO.write(f, newContent)
}
}
val fixJavaLinks: Match => String = m =>
m.group(1) + "?" + m.group(2).replace(".", "/") + ".html"
val javadocApiLink = """\"(http://docs\.oracle\.com/javase/8/docs/api/index\.html)#([^"]*)\"""".r
def hasJavadocApiLink(f: File): Boolean = (javadocApiLink findFirstIn IO.read(f)).nonEmpty
fixJavaLinksTask <<= fixJavaLinksTask triggeredBy (doc in Compile)
I took the answer by #jacek-laskowski and modified it so that it avoid hard-coded strings and could be used for any number of Java libraries, not just the standard one.
Edit: the location of rt.jar is now determined from the runtime using sun.boot.class.path and does not have to be hard coded.
The only thing you need to modify is the map, which I have called externalJavadocMap in the following:
import scala.util.matching.Regex
import scala.util.matching.Regex.Match
val externalJavadocMap = Map(
"owlapi" -> "http://owlcs.github.io/owlapi/apidocs_4_0_2/index.html"
)
/*
* The rt.jar file is located in the path stored in the sun.boot.class.path system property.
* See the Oracle documentation at http://docs.oracle.com/javase/6/docs/technotes/tools/findingclasses.html.
*/
val rtJar: String = System.getProperty("sun.boot.class.path").split(java.io.File.pathSeparator).collectFirst {
case str: String if str.endsWith(java.io.File.separator + "rt.jar") => str
}.get // fail hard if not found
val javaApiUrl: String = "http://docs.oracle.com/javase/8/docs/api/index.html"
val allExternalJavadocLinks: Seq[String] = javaApiUrl +: externalJavadocMap.values.toSeq
def javadocLinkRegex(javadocURL: String): Regex = ("""\"(\Q""" + javadocURL + """\E)#([^"]*)\"""").r
def hasJavadocLink(f: File): Boolean = allExternalJavadocLinks exists {
javadocURL: String =>
(javadocLinkRegex(javadocURL) findFirstIn IO.read(f)).nonEmpty
}
val fixJavaLinks: Match => String = m =>
m.group(1) + "?" + m.group(2).replace(".", "/") + ".html"
/* You can print the classpath with `show compile:fullClasspath` in the SBT REPL.
* From that list you can find the name of the jar for the managed dependency.
*/
lazy val documentationSettings = Seq(
apiMappings ++= {
// Lookup the path to jar from the classpath
val classpath = (fullClasspath in Compile).value
def findJar(nameBeginsWith: String): File = {
classpath.find { attributed: Attributed[File] => (attributed.data ** s"$nameBeginsWith*.jar").get.nonEmpty }.get.data // fail hard if not found
}
// Define external documentation paths
(externalJavadocMap map {
case (name, javadocURL) => findJar(name) -> url(javadocURL)
}) + (file(rtJar) -> url(javaApiUrl))
},
// Override the task to fix the links to JavaDoc
doc in Compile <<= (doc in Compile) map {
target: File =>
(target ** "*.html").get.filter(hasJavadocLink).foreach { f =>
//println(s"Fixing $f.")
val newContent: String = allExternalJavadocLinks.foldLeft(IO.read(f)) {
case (oldContent: String, javadocURL: String) =>
javadocLinkRegex(javadocURL).replaceAllIn(oldContent, fixJavaLinks)
}
IO.write(f, newContent)
}
target
}
)
I am using SBT 0.13.8.
I'm finishing a business card production flow (excel > xml > indesign > single page pdfs) and I would like to insert the employees' names in the filenames.
What I have now:
BusinessCard_01_Blue.pdf
BusinessCard_02_Blue.pdf
BusinessCard_03_Blue.pdf (they are gonna go up to the hundreds)
What I need (I can manipulate the name list with regex easily):
BusinessCard_01_CarlosJorgeSantos_Blue.pdf
BusinessCard_02_TaniaMartins_Blue.pdf
BusinessCard_03_MarciaLima_Blue.pdf
I'm a Java and Python toddler. I've read the related questions, tried this in Automator (Mac) and Name Mangler, but couldn't get it to work.
Thanks in advance,
Gus
Granted you have a map where to look at the right name you could do something like this in Java:
List<Files> originalFiles = ...
for( File f : originalFiles ) {
f.renameTo( new File( getNameFor( f ) ) );
}
And define the getNameFor to something like:
public String getNameFor( File f ) {
Map<String,String> namesMap = ...
return namesMap.get( f.getName() );
}
In the map you'll have the associations:
BusinessCard_01_Blue.pdf => BusinessCard_01_CarlosJorgeSantos_Blue.pdf
Does it make sense?
In Python (tested):
#!/usr/bin/python
import sys, os, shutil, re
try:
pdfpath = sys.argv[1]
except IndexError:
pdfpath = os.curdir
employees = {1:'Bob', 2:'Joe', 3:'Sara'} # emp_id:'name'
files = [f for f in os.listdir(pdfpath) if re.match("BusinessCard_[0-9]+_Blue.pdf", f)]
idnumbers = [int(re.search("[0-9]+", f).group(0)) for f in files]
filenamemap = zip(files, [employees[i] for i in idnumbers])
newfiles = [re.sub('Blue.pdf', e + '_Blue.pdf', f) for f, e in filenamemap]
for old, new in zip(files, newfiles):
shutil.move(os.path.join(pdfpath, old), os.path.join(pdfpath, new))
EDIT: This now alters only those files that have not yet been altered.
Let me know if you want something that will build the the employees dictionary automatically.
If you have a list of names in the same order the files are produced, in Python it goes like this untested fragment:
#!/usr/bin/python
import os
f = open('list.txt', 'r')
for n, name in enumerate(f):
original_name = 'BusinessCard_%02d_Blue.pdf' % (n + 1)
new_name = 'BusinessCard_%02d_%s_Blue.pdf' % (
n, ''.join(name.title().split()))
if os.path.isfile(original_name):
print "Renaming %s to %s" % (original_name, new_name),
os.rename(original_name, new_name)
print "OK!"
else:
print "File %s not found." % original_name
Python:
Assuming you have implemented the naming logic already:
for f in os.listdir(<directory>):
try:
os.rename(f, new_name(f.name))
except OSError:
# fail
You will, of course, need to write a function new_name which takes the string "BusinessCard_01_Blue.pdf" and returns the string "BusinessCard_01_CarlosJorgeSantos_Blue.pdf".