I have a HTTP request in a thread group that reads from a single column csv file to get values to populate a parameter in the request URL.
Below is my configuration for these:
There are 30 values in the csv data file.
My goal is to have each thread start at the beginning of the file once it gets to the end, effectively infinitely looping through the data values until the scheduler duration expires.
However, what actually happens is some requests try and use (see screenshot below) and therefore fail.
I have tried this but that just stops at the 30th iteration i.e. the end of the csv data file.
I assume I have some config option(s) wrong but I can't find anything online to suggest what they might be. Can anyone point me in the right direction (what i should be searching for?) or provide a solution?
Most probably it's test data issue, double check your CSV file and make sure it doesn't contain empty lines, if they are - remove them and your test should start working as expected.
For small files with only one column you can use __StringFromFile() function - it's much easier to set up and use.
Related
This is what I am doing.
I am using a While controller to iterate a CSV data file. I set "STOP THREAD ON EOF" to true in the element "CSV Data Set Config" because I want to read all the data in the file.
Then I'm using the data extracted to make two http requests and then compare the responses to see if there are any differences (these are two soap requests using the same body request but reading from two different databases). I am using a BeanShell assertion to compare the response: if there is no difference I am using prev.setSuccessful(true); and so I have a green light; if there is some difference I am using prev.setSuccessful(false); and so I have a red light in my results tree. Doing this way as soon as I find a difference my iteration stops, but I would like to continue until I read all the data in the CSV file and I still want to have a red light so I can easily check where I have errors. My CSV file contains thousands of records and I want to make my http requests with all the data. Is that possibile even if I have a failed assertion?
This is my project tree.
Thank you !
By default JMeter doesn't break any loops and should just continue in case of error. Check jmeter.log file, it should contain the reason of stopping the test.
So double check your Thread Group configuration and make sure that "Action to be taken after a Sampler error" is set to Continue
Also be informed that using Beanshell is a some form of a performance anti-pattern, since JMeter 3.1 you're supposed to be using JSR223 Test Elements and Groovy language for scripting so consider migrating to the JSR223 Assertion.
I solved the issue putting ${__javascript("false")} in the While Controller condition
I want to read sub-content the big file from some offset/position.
For example I have a file of 1M lines and I want to read 50 lines starting from 100th. (line no: 101 to 150 - both inclusive)
I think I should be using PositionalReadable.
https://issues.apache.org/jira/browse/HADOOP-519
I see that FSInputStream.readFully actually uses seek() method of Seekable.
When I check the underlying implementation of seek() I see that it uses BlockReader.skip()
Wouldn't the blockReader.skip() read the whole data till position to skip the bytes? Question is would HDFS load first 100 lines as well in order to get to 101th line.
How to make position to be at any desired offset in the file like 10000th line of the file without loading the rest of the content? Something what s3 offers in header-offsets.
Here is the similar question I found: How to read files with an offset from Hadoop using Java, but it suggest using seek() and that is argued in the comments that seek() is expensive operation and should be used sparingly. Which I guess is correct because seek seems to read all the data in order to skip till the position.
The short answer may or may not read as much data as skip(n).
As you said, seek() internally calls BlockReader.skip(). BlockReader is an interface type and is created via BlockReaderFactory(). The BlockReader implementation that is created is either BlockReaderRemote or BlockReaderLocal. (Exactly, ExternalBlockReader is also possible, but it is excluded because it is a special case)
BlockReaderRemote is used when a client reads data from a remote DataNode over the network via RPC over TCP. In this case, if you analyze the skip() method code, you can see that readNextPacket is repeatedly called as many times as n bytes to skip. That is, it actually reads the data to be skipped.
BlockReaderLocal is used when the client is on the same machine as the DataNode where the block is stored. In this case, the client can read the block file directly, and change dataPos to actually do an offset-based skip on the next read operation.
+Additional information (2023.01.19)
The above applies to both Hadoop 3.x.x and 2.x.x, but the path and name of the implementation have been changed from version 2.8.0 due to a change in the project structure.
< 2.8.0
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/BlockReaderLocal.java
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/RemoteBlockReader.java
>= 2.8.0
hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/client/impl/BlockReaderLocal.java
hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/client/impl/BlockReaderRemote.java
Related Jira issues
https://issues.apache.org/jira/browse/HDFS-8057
https://issues.apache.org/jira/browse/HDFS-8925
I recommend you to look at SequenceFile format may be it will suit your needs.
We use seek to read from arbitrary place of a file.
https://hadoop.apache.org/docs/r2.8.2/hadoop-project-dist/hadoop-common/api/org/apache/hadoop/io/SequenceFile.Reader.html#seek(long)
I have a simple xml file, which I update every 3 seconds (by completely replacing contents each time)
I have a problem when computer restarts for any reason (or I long-press the power button to shut it down) - the xml file ends up filled with zero-characters. Same length as proper data, but 0's instead.
I tried saving to tmp file first, validating the data and replacing the original xml file if data seems valid. Did not help. Looks like all validation works fine (SAXBuilder doesnt throw exceptions, I can locate proper child notes etc.), but the file is still corrupted in the end.
I use XMLOutputter and FileWriter to save the data to the temp file.
Then replace original with couple renameTo()'s
All works well if I just exit the application or kill the process from task manager. Just the restart/shutdown breaks things.
Any hints on why this is happening will be greatly appreciated.
I'm new to Spark and the Hadoop ecosystem and already fell in love with it.
Right now, I'm trying to port an existing Java application over to Spark.
This Java application is structured the following way:
Read file(s) one by one with a BufferedReader with a custom Parser Class that does some heavy computing on the input data. The input files are of 1 to maximum 2.5 GB size each.
Store data in memory (in a HashMap<String, TreeMap<DateTime, List<DataObjectInterface>>>)
Write out the in-memory-datastore as JSON. These JSON files are smaller of size.
I wrote a Scala application that does process my files by one worker but that is obviously not the most performance benefit I can get out of Spark.
Now to my problem with porting this over to Spark:
The input files are line-based. I usually have one message per line. However, some messages depend on preceding lines to form an actual valid message in the Parser. For example it could happen that I get data in the following order in an input file:
{timestamp}#0x033#{data_bytes} \n
{timestamp}#0x034#{data_bytes} \n
{timestamp}#0x035#{data_bytes} \n
{timestamp}#0x0FE#{data_bytes}\n
{timestamp}#0x036#{data_bytes} \n
To form an actual message that out of the "composition message" 0x036, the parser also needs the lines from message 0x033, 0x034 and 0x035. Other messages could also get in between these set of needed messages. The most messages can be parsed by reading a single line though.
Now finally my question:
How to get Spark to split my file correctly for my purposes? The files can not be Split "randomly"; they must be split in a way that makes sure that all my messages can be parsed and the Parser will not wait for input that he will never get. This means that each composition message (messages that depend on preceding lines) need to be in one split.
I guess there are several ways to achieve a correct output but I'll throw some ideas that I had into this post as well:
Define a manual Split algorithm for the file input? This will check that the last few lines of a split do not contain the start of a "big" message [0x033, 0x034, 0x035].
Split the file however spark wants but also add a fixed number of lines (lets say 50, that will do the job for sure) from the last split to the next split. Multiple data will be handled by the Parser class correctly and would not introduce any issues.
The second way might be easier, however I have no clue how to implement this in Spark. Can someone point me into the right direction?
Thanks in advance!
I saw your comment on my blogpost on http://blog.ae.be/ingesting-data-spark-using-custom-hadoop-fileinputformat/ and decided to give my input here.
First of all, I'm not entirely sure what you're trying to do. Help me out here: your file contains lines containing the 0x033, 0x034, 0x035 and 0x036 so Spark will process them separately? While actually these lines need to be processed together?
If this is the case, you shouldn't interpret this as a "corrupt split". As you can read in the blogpost, Spark splits files into records that it can process separately. By default it does this by splitting records on newlines. In your case however, your "record" is actually spread over multiple lines. So yes, you can use a custom fileinputformat. I'm not sure this will be the easiest solution however.
You can try to solve this using a custom fileinputformat that does the following: instead of giving line by line like the default fileinputformat does, you parse the file and keep track of encountered records (0x033, 0x034 etc). In the meanwhile you may filter out records like 0x0FE (not sure if you want to use them elsewhere). The result of this will be that Spark gets all these physical records as one logical record.
On the other hand, it might be easier to read the file line by line and map the records using a functional key (e.g. [object 33, 0x033], [object 33, 0x034], ...). This way you can combine these lines using the key you chose.
There are certainly other options. Whichever you choose depends on your use case.
I have a need for my application to be able to read large (very large, 100GB+) text files and process the content in these files potentially at different times. For instance, it might run for an hour and finish processing a few GBs, and then I shut it down and come back to it a few days later to resume processsing the same file.
To do this I will need to read the files into memory-friendly chunks; each chunk/page/block/etc will be read in, one at a time, processed, before then next chunk is read into memory.
I need the program to be able to mark where it is inside the input file, so if it shuts down, or if I need to "replay" the last chunk being processed, I can jump right to the point in the file where I am and continue processing. Specifically, I need to be able to do the following things:
When the processing begings, scan a file for a "MARKER" (some marker that indicates where we left off processing the last time)
If the MARKER exists, jump to it and begin processing from that point
Else, if the MARKER doesn't exist, then place a MARKER after the first chunk (for now, let's say that a "chunk" is just a line-of-text, as BufferedReader#readLine() would read in) and begin processing the first chunk/line
For each chunk/line processed, move the MARKER after the next chunk (thus, advancing the MARKER further down the file)
If we reach a point where there are no more chunks/lines after the current MARKER, we've finished processing the file
I tried coding this up myself and notice that BufferedReader has some interesting methods on it that sounds like they're suited for this very purpose: mark(), reset(), etc. But the Javadocs on them are kind of vague and I'm not sure that these "File Marker" methods will accomplish all the things I need to be able to do. I'm also completely open to a 3rd party JAR/lib that has this capability built into it, but Google didn't turn anything up.
Any ideas here?
Forget about markers. You cannot "insert" text without rewritting the whole file.
Use a RandomAccessFile and store the current position you are reading. When you need to open again the file, just use seek to find the position.
A Reader's "mark" is not persistent; it solely forms part of the state of the Reader itself.
I suggest that you not store the state information in the text file itself; instead, have a file alongside which stores the byte offset of the most recently processed chunk. That'll eliminate the obvious problems involving overwriting data in the original text file.
The marker of the buffered reader is not persisted after different runs of your application. I would neither change the content of that huge file to mark a position, since this can lead to significant IO and/or filesystem fragmentation, depending on your OS.
I would use a properties file to store the configuration of the program externally. Have a look at the documentation, the API is straight forward:
http://docs.oracle.com/javase/7/docs/api/java/util/Properties.html