Reading CSV data in a weird format

Gordon Hughes

Join Date: Mar 2023

Posts: 28
#1

Reading CSV data in a weird format

25 Sep 2023, 14:59

I am trying to find an efficient way of processing weather data that is nominally presented as a csv file but in two separate sections. A file of the data for a single location is formatted as follows:

"lat", "lon", "elev", "tzone"
55.0, 10.0, 150, CET

"tstamp", "temp", "wind", "total_solar", "direct_solar"
"2023-07-01T00:00:00Z", 19.0, 2.5, 0, 0
"2023-07-01T01:00:00Z", 18.5, 2.4, 0, 0
"2023-07-01T02:00:00Z", 18.0, 2.6, 0, 0.
"2023-07-01T03:00:00Z", 17.5, 2.4, 80, 100
....

The point is that there are, in effect, two csv files joined together with the first part containing common information for all the time series observations in the second part.

It is not difficult to read this data using two separate import delimited blocks, writing out each part to a temporary file and then merging these files. However, this is pretty inefficient when processing a large number of locations because you finish up with a minimum of 3 temporary files per location. It is possible to skip the write and join operations by using the first import delimited block to create local variables which are used to generate lat, lon, elev and tzone variables in the second import delimited block. But this is inefficient because it involves repeated overwriting of local variables. A third option is to use Python to carry out the same transformation.

Can anyone think of a more efficient way In Stata of reading this data to generate a single Stata file containing the merged dataset with variables <lat lon elev tzone tstamp temp wind total_solar direct_solar> ?
Tags: None
Daniel Schaefer

Join Date: Mar 2020

Posts: 814
#2

25 Sep 2023, 15:47

So there are multiple CSV files, and each file represents a time series for a separate location?

At the end of the day, you are going to have to go through and load each of these files from the disk into memory, separate out the two portions, then merge them together. You might be able to save time using frames instead of temporary files. That way you can keep the data in memory and you can minimize the amount of data you need to send over the Sata cable to and from the hard drive - a major bottleneck. Just don't forget to delete (or ideally reuse) frames when you are done with them. If you accumulate too much data in memory, you'll start using virtual memory on the disk, which will slow things way down again.

It is possible to skip the write and join operations by using the first import delimited block to create local variables which are used to generate lat, lon, elev and tzone variables in the second import delimited block. But this is inefficient because it involves repeated overwriting of local variables.

Nah, overwriting locals shouldn't have a meaningful impact on the execution time of the algorithm. If anything, you are taking advantage of the pre-allocated memory for a performance boost. I like this idea as an optimization, but think it might be a bit of a pain to program.

A third option is to use Python to carry out the same transformation.

You might get a performance boost in Python. Depends on the implementation. If you put the data in a pandas frame, that can be pretty slow. If you parse the files at a low level it should be fast, but more difficult (and time consuming) to program. The biggest bottleneck here is the cable to the hard drive, not the language.

Whatever you decide to do, I'd want to minimize the amount of time I spend on this along with the execution time. There's no reason you shouldn't be able to let this thing run as a background process for a couple of hours while you do other things, especially if it will save you 8 hours of programming time.

Last edited by Daniel Schaefer; 25 Sep 2023, 15:59.
1 like
Comment
Daniel Schaefer

Join Date: Mar 2020

Posts: 814
#3

25 Sep 2023, 15:55

My other piece of advice is to print out something like the name of the file you are currently processing to the console. When you're working with a process that has a long runtime, its very psychologically important that you have some kind of output. That way when you check in you can see how much progress you've made. You want to minimize how much you print to the console, so other commands should usually be done -quietly- in a long running algorithm, but its very important to have some sense of where you are in a process.

Last edited by Daniel Schaefer; 25 Sep 2023, 15:58.
1 like
Comment
Gordon Hughes

Join Date: Mar 2023

Posts: 28
#4

26 Sep 2023, 15:00

Thank you for your suggestions. I have managed to get most of what I want to work using data frames, thus reducing the disk write/read overheads. Still, using data frames would be a whole lot nicer if it were possible to append observations to a dataset stored in a data frame. I believe that has been on the Wish List since data frames were introduced so there may be some significant implementation constraint, perhaps relating to the management of memory.

I agree with your point about writing something to the console when running lengthy loops with -quietly-. I tend to use a timer output every x loop iterations. However, even this can blow up in one's face. Recently I had a program using the Mata LP class which seemed to get randomly - and very infrequently - stuck due (I think) to numerical errors. Runs which normally execute in 1 ms would take a 1000 secs or longer. This wasn't a case of a very large number of steps but a failure to find an improvement. The trouble was that timing loops seemed to interfere with attempts to trap the errors by terminating an optimisation that ran for more than, say, 1 sec.
Comment
Gordon Hughes

Join Date: Mar 2023

Posts: 28
#5

26 Sep 2023, 15:24

I should add one further point. When writing my previous response I was unaware of the xframeappend and fframeappend user-written procedures to append data frames to a master data frame. I have rewritten my code using the latter instead of writing small files to disk and using the -append- command. This reduced the execution time of my loop by 50%. The residual time is largely absorbed by executing API calls to an external database, so the improvement in Stata processing time is very large. While the user-written procedures are a major contribution, I think that most would wish to see a Stata -frame append- command.
Comment

Announcement

Reading CSV data in a weird format

Comment

Comment

Comment

Comment