Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    Dan Daugaard

    On reflection, the advice I've given above is going in the wrong direction.

    As with so many things in Stata these problems appear complicated only because your data is organized in a wide layout. If you switch to a long layout, they become easy, as the example code below shows.

    If there is some compelling reason you need to go back to wide layout, you can do that at the point that it is required. But think twice before doing that. There are only a small number of things in Stata that are easier to do with wide data. Until you know you're going to do one of them, its best to stick with a long layout.
    Code:
    use events.dta, clear
    generate i = 1 // to match to master
    sort event_month
    // replace the event text with a numeric id and value label
    rename event eventname
    encode eventname, generate(event)
    label list event
    drop eventname
    tempfile events
    save `events'
    
    use master.dta
    generate i = 1 // to match to events
    joinby i using `events'
    drop i
    order id event event_month
    
    generate diff = month-event_month
    
    bysort diff: egen avg_ff = mean(ff) if inlist(diff,-2,-1,0,1,2)
    egen toprint = tag(diff)
    list diff avg_ff if toprint & inlist(diff,-2,-1,0,1,2), noobs
    drop toprint
    
    sort id event month
    list if id==2708 & event==2, noobs abbreviate(12)
    describe
    Code:
    . label list event
    event:
               1 event_1
               2 event_2
    Code:
    . list diff avg_ff if toprint & inlist(diff,-2,-1,0,1,2), noobs
    
      +------------------+
      | diff      avg_ff |
      |------------------|
      |   -2   -.0023612 |
      |   -1    -.002323 |
      |    0   -.0020188 |
      |    1   -.0008208 |
      |    2   -.0016764 |
      +------------------+
    Code:
    . list if id==2708 & event==2, noobs abbreviate(12)
    
      +------------------------------------------------------------------------+
      |   id     event   event_month      month          ff   diff      avg_ff |
      |------------------------------------------------------------------------|
      | 2708   event_2      2015 Jan   2014 Oct    -.005559     -3           . |
      | 2708   event_2      2015 Jan   2014 Nov   -.0023612     -2   -.0023612 |
      | 2708   event_2      2015 Jan   2014 Dec     .000913     -1    -.002323 |
      | 2708   event_2      2015 Jan   2015 Jan   -.0016764      0   -.0020188 |
      | 2708   event_2      2015 Jan   2015 Feb   -.0025547      1   -.0008208 |
      +------------------------------------------------------------------------+
    
    . describe
    
    Contains data
      obs:            20                          
     vars:             7                          
     size:           560                          
    ------------------------------------------------------------------------------------------------
                  storage   display    value
    variable name   type    format     label      variable label
    ------------------------------------------------------------------------------------------------
    id              long    %12.0g                
    event           long    %8.0g      event      
    event_month     float   %tm..                 
    month           float   %tm..                 
    ff              float   %9.0g                 
    diff            float   %9.0g                 
    avg_ff          float   %9.0g                 
    ------------------------------------------------------------------------------------------------
    Sorted by: id  event  month
    Here are the data I used, reverse engineered from your example in post #14.
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input long id float(month ff)
    2708 657     -.005559
    2708 658 -.0023612394
    2708 659  .0009129993
    2708 660 -.0016764477
    2708 661 -.0025546604
    2709 657     -.005559
    2709 658 -.0023612394
    2709 659  .0009129993
    2709 660 -.0016764477
    2709 661 -.0025546604
    end
    format %tmCCYY_Mon month
    save master, replace
    
    clear
    input float event_month str16 event
    658 "event_1"
    660 "event_2"
    end
    format %tmCCYY_Mon event_month
    save events, replace

    Comment


    • #17
      Thank you so much for thinking into this problem William - especially noting the structural problem (wide vs long) I was creating. Thanks for following up with this alternative approach. I'll apply the approach to my full data set (and make improvements to my data set-up, variable names and display formats which became evident through the stages of insight you provided). I think I can see how I will be able to redo the avg_ff (for a range of diff) for categories of events (fyi, there are presently 120 individual events with 3 categories containing 40 specific events). Thank you, Dan

      Comment


      • #18
        Hi William, sorry to be dumb I’m having a small problems as I adapt my data to your suggestions.
        I’m trying to get my events data to look like you and I had, ie
        Code:
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        input float event_month str16 event
        658 "event_1"
        660 "event_2"
        end
        format %tmCCYY_Mon event_month
        But I am having a minor challenge, how do you generate a variable containing "event_1", "event_2" etc…? I’m trying to use the generate command with the =_n feature but can’t work out how to add the “event_” part. Eg my data looks like
        Code:
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        input float event_month
        372
        375
        376
        end
        format %tmCCYY_Mon event_month
        generate event = "event"_n
        save event, replace
        The other challenge I’m having is that I’m getting an “invalid file specification when I run line “joinby i using `events' “ in the next bit of code, ie
        Code:
        use master.dta
        generate i = 1 // to match to events
        joinby i using `events'
        Thank you for your help, Dan

        Comment


        • #19
          In my code in post #16 the following two commands
          Code:
          ...
          save `events'
          ...
          joinby i using `events'
          ...
          should have had the expanded local macro surrounded in quotation marks in case there are spaces in any of the directory names in the path to where your temporary files are created. (That's not a problem on my Mac and I often forget to use quotation marks, reinforcing a bad habit.)
          Code:
          ...
          save `"`events'"'
          ...
          joinby i using `"`events'"'
          ...
          For safety's sake I have used so-called compound double quotes
          Code:
          `" and "'
          as described in the output of help strings in case the path should itself contain single or double quotes.

          Do keep in mind that tempfiles, like the local macros that name them, are confined to the do-file within which they are run. So the entire batch of code must be run as one run.

          With regard to creating the event variable, I understood from your post #6 that it already existed, not that it needed to be created. There is no point in creating an entirely arbitrary and uninformative string variable and then using encode to create from it a numeric variable with the string variable values as an aribtrary and uninformative value label. Instead, create a a numeric event to begin with.

          Here is the revised code.
          Code:
          use events.dta, clear
          generate i = 1 // to match to master
          sort event_month
          // create a sequential event id
          generate event = _n
          tempfile events
          save `"`events'"'
          
          use master.dta
          generate i = 1 // to match to events
          joinby i using `"`events'"'
          drop i
          order id event event_month
          
          generate diff = month-event_month
          
          bysort diff: egen avg_ff = mean(ff) if inlist(diff,-2,-1,0,1,2)
          egen toprint = tag(diff)
          list diff avg_ff if toprint & inlist(diff,-2,-1,0,1,2), noobs
          drop toprint
          
          sort id event month
          list if id==2708, noobs abbreviate(12) sepby(event)
          describe
          Here is the revised end result.
          Code:
          . list if id==2708, noobs abbreviate(12)
          
            +----------------------------------------------------------------------+
            |   id   event   event_month      month          ff   diff      avg_ff |
            |----------------------------------------------------------------------|
            | 2708       1      2014 Nov   2014 Oct    -.005559     -1    -.002323 |
            | 2708       1      2014 Nov   2014 Nov   -.0023612      0   -.0020188 |
            | 2708       1      2014 Nov   2014 Dec     .000913      1   -.0008208 |
            | 2708       1      2014 Nov   2015 Jan   -.0016764      2   -.0016764 |
            | 2708       1      2014 Nov   2015 Feb   -.0025547      3           . |
            |----------------------------------------------------------------------|
            | 2708       2      2015 Jan   2014 Oct    -.005559     -3           . |
            | 2708       2      2015 Jan   2014 Nov   -.0023612     -2   -.0023612 |
            | 2708       2      2015 Jan   2014 Dec     .000913     -1    -.002323 |
            | 2708       2      2015 Jan   2015 Jan   -.0016764      0   -.0020188 |
            | 2708       2      2015 Jan   2015 Feb   -.0025547      1   -.0008208 |
            +----------------------------------------------------------------------+
          
          . describe
          
          Contains data
            obs:            20                          
           vars:             7                          
           size:           560                          
          ------------------------------------------------------------------------------------------------
                        storage   display    value
          variable name   type    format     label      variable label
          ------------------------------------------------------------------------------------------------
          id              long    %12.0g                
          event           float   %9.0g                 
          event_month     float   %tm..                 
          month           float   %tm..                 
          ff              float   %9.0g                 
          diff            float   %9.0g                 
          avg_ff          float   %9.0g                 
          ------------------------------------------------------------------------------------------------
          Sorted by: id  event  month
          Here are the example input datasets.
          Code:
          * Example generated by -dataex-. To install: ssc install dataex
          clear
          input long id float(month ff)
          2708 657     -.005559
          2708 658 -.0023612394
          2708 659  .0009129993
          2708 660 -.0016764477
          2708 661 -.0025546604
          2709 657     -.005559
          2709 658 -.0023612394
          2709 659  .0009129993
          2709 660 -.0016764477
          2709 661 -.0025546604
          end
          format %tmCCYY_Mon month
          save master, replace
          
          clear
          input float event_month
          658
          660
          end
          format %tmCCYY_Mon event_month
          save events, replace

          Comment


          • #20
            Thank you for the insights on the role of quotation marks re locations in macros William - I now comprehend what was triggering my errors.
            Yes, your code works very well on the full set of data - I will now incorporate my control variables.
            Sorry to persist with a question that doesn’t seem relevant, but I am learning a lot from this conversation. When you were presenting earlier advice it became obvious that a few of the attributes of how you structure and present the data and results was more useful than my initial attempts (eg your style of monthly formatting can help solve description confusion that I had etc). In a similar way, the “event_1”, “event_2” etc names for the event months (which could potentially become the new event variable names) could enable me to input my existing clumsy names (eg “release of UN report on slavery” etc) to a form which could easily help categorise and organise the event types and then I could rerun the code for relevant subsets of the data (eg “environmental_event_1” or “oil_spill_1” etc).
            I can fully understand if you are tired of responding to this narrow focus of discussion and therefore pls just indicate if that’s the case and I’ll attempt to post this as a stand alone element.
            The specific issue I'm trying to resolve is how to generate a variable which has sequential numbers following a word - eg "event_1", "event_2" etc…? I think it requires using the generate command combined with the =_n feature - but I can only create either a series of numbers of just the word “event” Eg the initial data looked like
            Code:
            * Examples generated by -dataex-. To install: ssc install dataex
            clear
            input float event_month
            372
            375
            376
            end
            And then I’m trying (unsuccessfully) combinations like
            Code:
            generate event = "event"_n
            but what I am hoping to achieve is something like:
            Code:
            clear
            input float event_month str16 event
            658 "event_1"
            660 "event_2"
            end
            I'll then generate event names conditional on the words I have in the current clumsy descriptions. Thank you for all your help, Dan
            Last edited by Dan Daugaard; 04 Jun 2019, 15:25.

            Comment


            • #21
              The specific issue I'm trying to resolve is how to generate a variable which has sequential numbers following a word - eg "event_1", "event_2" etc…?
              Dan, the problem we are having is that you ask questions about what you think it is you need to do to accomplish what you want, but you don't clearly describe what you want. And it is often the case that what you think you need is not what you actually need.

              I'll then generate event names conditional on the words I have in the current clumsy descriptions.
              So there are actually meaningful descriptions associated with each event, as opposed to the meaningless descriptions event_1 etc. It is more productive to provide you with code that works with your real descriptions that it is to first suggest code for meaningless descriptions and later revise it to work with meaningful descriptions.

              Your examples of your data should be as realistic as you can make them. If you plan on attaching a description to each event, attach the description you want, not the description you think you need to compromise on because somewhere downstream you think it may be needed for a variable name. In what follows I add a meaningful description to the events data, and affix a sequence number to these descriptions, and then show that they can be used to limit a command to a subset of observations. I've highlighted my changes from the previous example in red.

              Here is the example data I am using now.
              Code:
              * Example generated by -dataex-. To install: ssc install dataex
              clear
              input long id float(month ff)
              2708 657     -.005559
              2708 658 -.0023612394
              2708 659  .0009129993
              2708 660 -.0016764477
              2708 661 -.0025546604
              2709 657     -.005559
              2709 658 -.0023612394
              2709 659  .0009129993
              2709 660 -.0016764477
              2709 661 -.0025546604
              end
              format %tmCCYY_Mon month
              save master, replace
              
              * Example generated by -dataex-. To install: ssc install dataex
              clear
              input float event_month str20 event_descr
              658 "Vampire Attack"   
              660 "Plague of Locusts"
              661 "Vampire Attack"
              end
              format %tmCCYY_Mon event_month
              save events, replace
              Here is the code to process this data.
              Code:
              use events.dta, clear
              generate i = 1 // to match to master
              // sequence number the descriptions
              by event_descr (event_month), sort: replace event_descr = event_descr + " " + strofreal(_n)
              // create a sequential event id
              sort event_month
              generate event = _n
              tempfile events
              save `"`events'"'
              
              use master.dta
              generate i = 1 // to match to events
              joinby i using `"`events'"'
              drop i
              order id event event_month event_descr
              
              generate diff = month-event_month
              
              bysort diff: egen avg_ff = mean(ff) if inlist(diff,-2,-1,0,1,2)
              egen toprint = tag(diff)
              list diff avg_ff if toprint & inlist(diff,-2,-1,0,1,2), noobs abbreviate(12)
              drop toprint
              
              sort id event month
              // data for id 2708 events
              list if id==2708, noobs abbreviate(12) sepby(event)
              // find Vampire Attack 1
              list if event_descr == "Vampire Attack 1" & month==event_month, noobs abbreviate(12)
              // find all the Vampire Attacks
              list id event_month event_descr ///
                  if strpos(event_descr,"Vampire Attack") & month==event_month, noobs abbreviate(12)
              describe
              Here is the output.
              Code:
              . // data for id 2708 events
              . list if id==2708, noobs abbreviate(12) sepby(event)
              
                +--------------------------------------------------------------------------------------------+
                |   id   event   event_month           event_descr      month          ff   diff      avg_ff |
                |--------------------------------------------------------------------------------------------|
                | 2708       1      2014 Nov      Vampire Attack 1   2014 Oct    -.005559     -1   -.0021075 |
                | 2708       1      2014 Nov      Vampire Attack 1   2014 Nov   -.0023612      0   -.0021974 |
                | 2708       1      2014 Nov      Vampire Attack 1   2014 Dec     .000913      1   -.0008208 |
                | 2708       1      2014 Nov      Vampire Attack 1   2015 Jan   -.0016764      2   -.0016764 |
                | 2708       1      2014 Nov      Vampire Attack 1   2015 Feb   -.0025547      3           . |
                |--------------------------------------------------------------------------------------------|
                | 2708       2      2015 Jan   Plague of Locusts 1   2014 Oct    -.005559     -3           . |
                | 2708       2      2015 Jan   Plague of Locusts 1   2014 Nov   -.0023612     -2   -.0007241 |
                | 2708       2      2015 Jan   Plague of Locusts 1   2014 Dec     .000913     -1   -.0021075 |
                | 2708       2      2015 Jan   Plague of Locusts 1   2015 Jan   -.0016764      0   -.0021974 |
                | 2708       2      2015 Jan   Plague of Locusts 1   2015 Feb   -.0025547      1   -.0008208 |
                |--------------------------------------------------------------------------------------------|
                | 2708       3      2015 Feb      Vampire Attack 2   2014 Oct    -.005559     -4           . |
                | 2708       3      2015 Feb      Vampire Attack 2   2014 Nov   -.0023612     -3           . |
                | 2708       3      2015 Feb      Vampire Attack 2   2014 Dec     .000913     -2   -.0007241 |
                | 2708       3      2015 Feb      Vampire Attack 2   2015 Jan   -.0016764     -1   -.0021075 |
                | 2708       3      2015 Feb      Vampire Attack 2   2015 Feb   -.0025547      0   -.0021974 |
                +--------------------------------------------------------------------------------------------+
              
              . // find Vampire Attack 1
              . list if event_descr == "Vampire Attack 1" & month==event_month, noobs abbreviate(12)
              
                +-----------------------------------------------------------------------------------------+
                |   id   event   event_month        event_descr      month          ff   diff      avg_ff |
                |-----------------------------------------------------------------------------------------|
                | 2708       1      2014 Nov   Vampire Attack 1   2014 Nov   -.0023612      0   -.0021974 |
                | 2709       1      2014 Nov   Vampire Attack 1   2014 Nov   -.0023612      0   -.0021974 |
                +-----------------------------------------------------------------------------------------+
              
              . // find all the Vampire Attacks
              . list id event_month event_descr ///
              >     if strpos(event_descr,"Vampire Attack") & month==event_month, noobs abbreviate(12)
              
                +---------------------------------------+
                |   id   event_month        event_descr |
                |---------------------------------------|
                | 2708      2014 Nov   Vampire Attack 1 |
                | 2708      2015 Feb   Vampire Attack 2 |
                | 2709      2014 Nov   Vampire Attack 1 |
                | 2709      2015 Feb   Vampire Attack 2 |
                +---------------------------------------+
              
              . describe
              
              Contains data
                obs:            30                          
               vars:             8                          
               size:         1,440                          
              ------------------------------------------------------------------------------------------------
                            storage   display    value
              variable name   type    format     label      variable label
              ------------------------------------------------------------------------------------------------
              id              long    %12.0g                
              event           float   %9.0g                 
              event_month     float   %tm..                 
              event_descr     str20   %20s                  
              month           float   %tm..                 
              ff              float   %9.0g                 
              diff            float   %9.0g                 
              avg_ff          float   %9.0g                 
              ------------------------------------------------------------------------------------------------
              Sorted by: id  event  month

              Comment


              • #22
                Hi William, you are a champion! thank you for your help and for your guidance re clearly stating what I am hoping to achieve. The code for calculating averages in this form was very useful to identify patterns to guide my research. However, I need to go back to the simpler format for some regression tests. Is it therefore possible to get another very dumb question answered?
                In regard to your advice in an earlier message #5
                The end product at that stage looked similar to this (in the full data set there are many event variables - ie "ev_mon_"'s)
                Code:
                * Example generated by -dataex-. To install: ssc install dataex
                clear
                input long id float(month ff ev_mon_1 ev_mon_2)
                2708 657     -.005559 661 670
                2708 658 -.0023612394 661 670
                2708 659  .0009129993 661 670
                2708 660 -.0016764477 661 670
                2708 661 -.0025546604 661 670
                end
                format %tmCCYY_Mon month
                format %tmCCYY_Mon ev_mon_1
                format %tmCCYY_Mon ev_mon_2
                Do you know how I could generate a new variable where the expression references across the full sequence of “ev-mon” variables (ie ev_mon_1, ev_mon_2, ev_mon_3 etc to ev_mon_120). The new variable (named exog_dummy) will be 1 if the current observation’s month is equal or greater than the ev_mon_1 date but less than (ev_mon_1 + 7). However, the exog_dummy will be 1 if this comparison is satisfied by any of the ev_mon_ variables. Does that make sense?
                Thank you again for your help and your patience, Dan

                Comment


                • #23
                  PS just fyi, because of the somewhat generic (introductory?) nature of this last question, I have also placed it
                  https://www.statalist.org/forums/for...iple-variables

                  Comment

                  Working...
                  X