Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merging and reshaping data

    Dear all,

    I am currently writing my master thesis and I am a beginner in STATA.

    I need to merge two datasets that have a different time format.

    The first data set has the following format with a panel data set given quarter-hourly. It looks something like this:

    begintime endtime Total_Load
    01jan2015 00:00:00 01jan2015 00:15:00 44840
    01jan2015 00:15:00 01jan2015 00:30:00 42412
    01jan2015 00:30:00 01jan2015 00:45:00 41901
    01jan2015 00:45:00 01jan2015 01:00:00 40710

    The second data set has a format that looks like this where begin and endtime is always different:

    begintime endtime Output
    01jan2015 00:00:00 01jan2015 01:15:00 318
    01jan2015 00:00:00 02jan2015 01:00:00 82
    01jan2015 03:00:00 01jan2015 05:00:00 215
    01jan2015 11:00:00 01jan2015 14:00:00 1971

    I need to merge the two datasets and the goal of the analysis is to get the mean of the Total_load and Output per hour each day.
    My Professor told me that I need to use the reshape command to transform the second dataset since the observations must have a unique identifer to be able to merge. However, the second dataset has repeated observations in begin and endtime.
    How can I generate a unique id for such a dataset? or how can I reshape such dataset from long to wide?

    I would appreciate any advice. Thank you in advance for any help.

  • #2
    Hi Bianca, welcome to the forum. Unfortunately, I'm not sure there is enough information here to help you. You say you have panel data, right? So what variable identifies the non-temporal units in the first and second dataset?

    If your second dataset is in wide format, usually you will have the same variable repeated over time (or perhaps over your cross-sectional unit in this case?). For example, you might have var1, var2, var3, var4, and var5. Here, the numbers 1-5 might represent different cross-sectionional units. Do you have a data structure like that in your second dataset?

    At the end of the day, you need a set of variables that uniquely identify each observation in each dataset. In panel data, that is usually a temporal and a cross-sectional id. It might look something like this:

    Code:
    clear
    input int(time cross_section)
    1 1
    1 2
    1 3
    1 4
    2 1
    2 2
    2 3
    2 4
    3 1
    3 2
    3 3
    3 4
    end
    Where you have three days and 4 members of your panel. You not only need a unique identifier (or set of variables that uniquely identify the data) but you also need exactly the same identifiers in both data sets. Thanks for the data examples, but could you follow up with a data example from each dataset generated using the -dataex- command? The command is very easy to use and would be a lot of help here. Be sure to include any relevant cross-sectional identifier variable and any variables that follow the wide data pattern I describe above.
    Last edited by Daniel Schaefer; 20 Dec 2023, 11:44.

    Comment


    • #3
      Dear Daniel,

      thank you so much for your response.
      Yes, both datasets are panel data and in a long format.
      The first dataset identifies the load that is generated by a power plant. The time period here is straightforward in 15-minute intervals. I copied an example of the data via dataex:

      * Example generated by -dataex-. For more info, type help dataex
      clear
      input str16(begin end) long load
      "01.01.2015 00:00" "01.01.2015 00:15" 42955
      "01.01.2015 00:15" "01.01.2015 00:30" 42412
      "01.01.2015 00:30" "01.01.2015 00:45" 41901
      "01.01.2015 00:45" "01.01.2015 01:00" 41355
      "01.01.2015 01:00" "01.01.2015 01:15" 40710
      "01.01.2015 01:15" "01.01.2015 01:30" 40204
      "01.01.2015 01:30" "01.01.2015 01:45" 39640
      "01.01.2015 01:45" "01.01.2015 02:00" 39324
      "01.01.2015 02:00" "01.01.2015 02:15" 39002
      "01.01.2015 02:15" "01.01.2015 02:30" 38869
      "01.01.2015 02:30" "01.01.2015 02:45" 38783
      "01.01.2015 02:45" "01.01.2015 03:00" 38598
      "01.01.2015 03:00" "01.01.2015 03:15" 38626
      "01.01.2015 03:15" "01.01.2015 03:30" 38459
      "01.01.2015 03:30" "01.01.2015 03:45" 38414
      "01.01.2015 03:45" "01.01.2015 04:00" 38461
      "01.01.2015 04:00" "01.01.2015 04:15" 38559
      "01.01.2015 04:15" "01.01.2015 04:30" 38490
      "01.01.2015 04:30" "01.01.2015 04:45" 38713
      end
      [/CODE]

      The second dataset looks into redispatch measures of power plants where the variable of interest is the output. The time periods here are always different and sometimes repetitive. To be able to merge the two datasets, my professor adviced me to use the reshape command to be able to get a unique identifier, e.g. time 00:00 (1) with output (1) and time 00:00 (2) with output (2) etc.
      An example of the dataset here:

      Code:
       dataex begin end output_mwh power_plant redispatch
      * Example generated by -dataex-. For more info, type help dataex
      
      clear
      input str16(begin end)                             int output_mwh str47  power_plant str10     redispatch
      "01.01.2015 00:00" "01.01.2015 01:15"   318 "Gebersdorf 2"                                     "erhöhen"  
      "01.01.2015 00:00" "01.01.2015 01:00"    82 "Zolling 5"                                             "erhöhen"  
      "01.01.2015 00:00" "02.01.2015 00:00"  8300 "Grosskraftwerk Mannheim AG "        "erhöhen"  
      "01.01.2015 00:00" "01.01.2015 01:00"    50 "Heizkraftwerk Heilbronn"                     "erhöhen"  
      "01.01.2015 00:00" "02.01.2015 00:00" 13901 "Boxberg, Jänschwalde"                    "reduzieren"
      "01.01.2015 03:00" "01.01.2015 04:00"    50 "Heizkraftwerk Altbach/Deizisau "         "erhöhen"  
      "01.01.2015 11:00" "02.01.2015 00:00"  2966 "Zolling 5"                                            "erhöhen"  
      "01.01.2015 11:00" "01.01.2015 14:00"   150 "Heizkraftwerk Altbach/Deizisau "        "erhöhen"  
      "01.01.2015 13:00" "01.01.2015 16:00"   600 "Jänschwalde, Schwarze Pumpe"        "reduzieren"
      "01.01.2015 15:15" "02.01.2015 00:00"  1971 "Gebersdorf 2"                                    "erhöhen"  
      "01.01.2015 17:00" "01.01.2015 19:00"   100 "Heizkraftwerk Altbach/Deizisau "         "erhöhen"  
      "01.01.2015 18:00" "02.01.2015 00:00"   930 "Moorburg"                                          "reduzieren"
      "01.01.2015 20:00" "01.01.2015 21:00"   175 "Gebersdorf 2"                                    "erhöhen"  
      "01.01.2015 20:00" "02.01.2015 00:00"   670 "Boxberg"                                            "reduzieren"
      "01.01.2015 21:00" "02.01.2015 00:00"   150 "Heizkraftwerk Altbach/Deizisau "         "erhöhen"  
      "02.01.2015 00:00" "03.01.2015 00:00"  3782 "Gebersdorf 2"                                    "erhöhen"  
      "02.01.2015 00:00" "03.01.2015 00:00"  5682 "Zolling 5"                                           "erhöhen"  
      "02.01.2015 00:00" "02.01.2015 07:00"  1015 "Isar 2 (SWM)"                                    "erhöhen"  
      "02.01.2015 00:00" "02.01.2015 03:00"   450 "Grosskraftwerk Mannheim AG "         "erhöhen"  
      "02.01.2015 00:00" "03.01.2015 00:00"  5350 "Heizkraftwerk Altbach/Deizisau "        "erhöhen"  
      end
      I hope it is more clear now.
      Thanks again for any help.

      Comment


      • #4
        I notice your first dataset does not have (or you do not provide) a power_plant variable. I take it power plants are your cross-sectional unit. Does that mean you actually have a time-series in the first dataset (not a panel, because there is no cross-sectional unit) that characterizes an aggregate feature of all of the power plants over a specific 15 minute interval? I take it your advisor is essentially recommending that you reshape the second dataset to the wide format so that each power plant measurement is in its own column, and each time interval gets it's own row. Does that sound correct? Or do you want to end up with a situation where each time-interval gets its own column and each power plant gets its own row? The second case seems to be what you are describing above, but I'm having trouble seeing how you might merge that with the first dataset, which doesn't have a power plant variable.

        That might be a start, but it doesn't explain everything you want to do here. The first dataset has 15 minute intervals. The second dataset has intervals that are an hour long, intervals that are a day long, and at least one interval that is an hour and fifteen minutes long. So do you want to associate each 15 minute interval in the first set with the corresponding time-range that the interval fits into in the second dataset?

        Comment


        • #5
          Yes, it is correct that the first dataset is a time-series with load as the aggregated variable of all power plants in 15 minutes intervals. The second dataset is the panel dataset with different and some repetitive time intervals for redispatch measurements of different power plants. In the end, my goal is to get the average (mean) of load (first dataset) in comparison to output (second dataset) and to aggregate it to an hourly or daily base.
          Yes, he recommended me to reshape the second dataset to the wide format so that each time-interval gets its own column to be able to merge it with the corresponding time intervals in the other dataset.I am not sure if there is another way to do it.

          Comment


          • #6
            I can see two ways to approach this. The first way is to combine the two data sets by pairing each 15 minute period in the first data set to every interval in the second one that contains it. In theory this is not hard to do. But if the data sets are large, it might be impossible due to memory limitations, or it might be possible but take an unreasonably long amount of time to finish.

            The other way is motivated by your observation that your ultimate goal is an hourly data set matching hours with load and output. This suggests that in the first data set, we aggregate up to the hourly level: in each hour, I imagine that the total load for the hour is just the sum of the load in the four 15-minute intervals that belong to that hour. The harder part is breaking up the intervals in the second data set into one-hour subintervals when they are longer than one hour: in particular, I don't know how the output in, say, an interval that begins at 00:00 and ends at 01:15 gets allocated between the 00:00-01:00 hour and the 01:00-02:00 hour. You or your adviser may know how to do this. But for simplicity here, I'll simply assume that it's done in proportion to the time overlapping the hour.

            Once both data sets have been turned into hourly data sets, then they can just -merge 1:1-.

            Of course, this is going to involve a fair amount of datetime variable manipulation, so for starters those string variables showing timestamps have to be converted.

            Code:
            //    AGGREGATE LOAD DATA UP TO HOURLY
            use load_data, clear
            foreach v of varlist begin end {
                gen double _`v' = clock(`v', "DMYhm"), after(`v')
                assert missing(_`v') == missing(`v')
                format _`v' %tc
                drop `v'
                rename _`v' `v'
            }
            gen double start_hour = cofC(Cdhms(dofc(begin), hh(begin), 0, 0))
            format start_hour %tcddMonCCYY_HH:MM:SS
            collapse (sum) load, by(start_hour)
            tempfile hourly_load
            save `hourly_load'
            
            //    DISAGGREGATE OUTPUT DATA TO HOURLY LEVEL
            use output_data, clear
            foreach v of varlist begin end {
                gen double _`v' = clock(`v', "DMYhm"), after(`v')
                assert missing(_`v') == missing(`v')
                format _`v' %tc
                drop `v'
                rename _`v' `v'
            }
            //    EXPAND TO HOURLY OBSERVATIONS
            gen `c(obs_t)' obs_no = _n
            gen duration = clockdiff_frac(begin, end, "h")
            gen double start_hour = cofC(Cdhms(dofc(begin), hh(begin), 0, 0))
            format start_hour %tcddMonCCYY_HH:MM:SS
            gen expander = ceil(duration)
            expand expander
            by obs_no, sort: replace start_hour = start_hour[1] + msofhours(_n-1)
            gen double finish_hour = start_hour + msofhours(1)
            format finish_hour %tcddMonCCYY_HH:MM:SS
            
            //    NOW ALLOCATE THE output IN PROPORTION TO TIME IN THAT HOUR
            gen double overlap = clockdiff_frac(max(begin, start_hour), min(end, finish_hour), "h")
            gen allocated_output = overlap*output_mwh/duration
            collapse (sum) allocated_output (first) finish_hour, by(start_hour)
            
            //    COMBINE THE DATA SETS
            merge 1:1 start_hour using `hourly_load'
            Added: Replace the italicized filenames in the -use- commands by the actual names of your data sets.

            In your example data, as is clear, the output data extends over a much more extended period of time than the load data, so very little of the data ends up paired. But presumably that will either not be the case, or only occasionally so, in your full data sets.

            Comment

            Working...
            X