Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Creation of a longitudinal dataset from an original messy dataset

    Hi everyone,

    I would like to create a longitudinal dataset from an uncleaned dataset, and I don't know where to start. I need help, please. Basically, my dataset looks like:




    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str28 municipality_1 str32 municipality_2 str28 municipality_3 str32 municipality_4 str17 power_mw str7 power_kw str29 renewables_type int(publication_date_boe_ddmmyyyy boe_resolution_date_ddmmyyyy)
    "Yebes"               " Horche"               ""                     ""                 "55"     "" "Solar" 23146 23120
    "Yebes"               " Horche"               ""                     ""                 "67.5"   "" "Solar" 23145 23120
    "Yepes"               " Numancia de La Sagra" " Pantoja"             " Añover de Tajo" "242.35" "" "Solar" 23162 23121
    "Yunquera de Henares" " Fontanar"             " Tórtola de Henares" " Guadalajara"     "42,5"   "" "Solar" 23176 23166
    "Zaragoza"            " Teruel"               " Tarragona"           "Barcelona."       "38.2"   "" "Solar" 23224 23204
    end
    format %td publication_date_boe_ddmmyyyy
    format %td boe_resolution_date_ddmmyyyy


    I have some variables representing each municipality involved in a renewable construction project in Spain, from -municipality_1- to -municipality_26-. Some projects could contain 1 or more municipalities involved. I have the full dates of publication of the status of the project (-publication_date_boe_ddmmyyyy-), and the decision (-boe_resolution_date_ddmmyyyy-).

    Here's what I'd ideally like to have. This is an example and does not necessarily reflect the reality given above:
    municipality_1 quarter new_installations_dummy cum_new_installations_dummy new_capacity_installed_mw cumul_new_capacity_installed_mw new_capacity_installed_kw cumul_new_capacity_installed_kw renewables_type (encoded) count_other_municipalities_wo_first municipality_2 municipality_3
    "Yebes" 2018q1 0 0 . . . . . 0
    "Yebes" 2018q2 0 0 . . . . . 0
    "Yebes" 2018q3 1 1 62.55 62.55 . . Solar 1 "Barcelona"
    "Yebes" 2018q4 0 1 . . . . . 0
    "Yebes" 2019q1 0 1 . . . . . 0
    "Yebes" 2019q2 1 2 350.10 412.65 . . Solar 2 "Zaragoza" "Teruel"
    "Yebes" 2019q3 0 2 . . . . . 0
    "Yebes" 2019q4 0 2 . . . . . 0
    ... ... ... ... ... ... ... ... ... ... ... ...
    I'd like to have the same number of quarters. In other words, I'd like the time variable for each municipality to start at 2017q1 up to 2023q4, even if there are a lot of 0s and missing values. Is this possible?
    Otherwise, is it possible to have as many rows as quarters?


    Thank you very much in advance for your help and time!


    Best regards,

    Michael
    Last edited by Michael Duarte Goncalves; 30 Apr 2024, 05:13.

  • #2
    Does this do what you want?. The code assumes you want to assign observations to quarters using the publication date rather than the resolution date.

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str28 municipality_1 str32 municipality_2 str28 municipality_3 str32 municipality_4 str17 power_mw str7 power_kw str29 renewables_type int(publication_date_boe_ddmmyyyy boe_resolution_date_ddmmyyyy)
    "Yebes"               " Horche"               ""                     ""                 "55"     "" "Solar" 23146 23120
    "Yebes"               " Horche"               ""                     ""                 "67.5"   "" "Solar" 23145 23120
    "Yepes"               " Numancia de La Sagra" " Pantoja"             " Añover de Tajo" "242.35" "" "Solar" 23162 23121
    "Yunquera de Henares" " Fontanar"             " Tórtola de Henares" " Guadalajara"     "42,5"   "" "Solar" 23176 23166
    "Zaragoza"            " Teruel"               " Tarragona"           "Barcelona."       "38.2"   "" "Solar" 23224 23204
    end
    format %td publication_date_boe_ddmmyyyy
    format %td boe_resolution_date_ddmmyyyy
    
    *Generate the quarterly date from the publication date
    gen qtr=qofd(publication)
    format qtr %tq
    
    encode municipality_1, gen(muni_1)
    
    tempfile raw
    save `raw'
    keep muni_1
    duplicates drop
    expand 28
    bysort muni_1 : gen qtr = yq(2016, 4) + _n
    format qtr %tq
    
    tsset muni_1 qtr
    
    *merge your original data back
    merge 1:m muni_1 qtr using `raw'
    
    list if municipality_1 !=""
    Devra Golbe
    Professor Emerita, Dept. of Economics
    Hunter College, CUNY

    Comment


    • #3
      Good morning Professor Devra Golbe,

      Sorry for the late reply.
      I had a hard time finding this. You solved my problem in one post.

      Thank you very much for your suggestion! It is exactly what I was looking for.

      Kind regards,

      Michael

      Comment


      • #4
        Could anyone explain me why in post #2 expand 28 is used, please?


        I understand that this command replaces each observation with 28 copies of the observation (so the first one is retained, and 27 new observations are created). But why exactly 28?
        Is there a reason for that?

        Thank you in advance for your help!

        All the best,
        Michael

        Edit: Could it be that Prof.Golbe uses expand 28 since there are 28 quarters between 2017q1 and 2023q4?
        Last edited by Michael Duarte Goncalves; 06 May 2024, 04:49.

        Comment


        • #5
          Yes indeed- 28 quarters between 2017q1 and 2023q4. I should have commented that line to explain why I picked that number.
          Devra Golbe
          Professor Emerita, Dept. of Economics
          Hunter College, CUNY

          Comment


          • #6
            Ok, I understand now.

            Thank you so much for your answer, Prof. Devra Golbe.
            Lovely day.

            Michael

            Comment

            Working...
            X