Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Using a secondary dataset to generate new variables to main dataset given conditions

    Hi all,

    So as usual, I'm having a hard time describing what I'm trying to do, which I believe is why googling has been ineffective for me. So let me use some examples.

    I have a primary dataset that look like this:

    Table 1
    Address State Year PolicyType1 PolicyType2 PolicyType2 Detail PolicyType3 PolicyType3 Detail
    12 road CA 2006 . . . . .
    234 st RI 2012 . . . . .
    3534 ave RI 2007 . . . . .
    125 road WV 2016 . . . . .
    534 street CA 2014 . . . . .
    2423 lane OH 2009 . . . . .
    I also have a secondary dataset that looks like this


    Table 2
    PolicyType State Year PolicyDetail2 PolicyDetail3
    1 CA 2008 . .
    1 CA 2005 . .
    2 RI 2009 Abc .
    3 OH 2001 . 50
    What I want is to combine these datasets so that each observation in table 1 gets a value added to the policy (counting number of policies) if it is in or after the year of the policy and in the state. I also want the corresponding policy detail to be added. I know that's a bit confusing, so in the example above, I would want table 1 to be:


    Table 1
    Address State Year PolicyType1 PolicyType2 PolicyType2 Detail PolicyType3 PolicyType3 Detail
    12 road CA 2006 1 . . . .
    234 st RI 2012 . 1 Abc . .
    3534 ave RI 2007 . . . . .
    125 road WV 2016 . . . . .
    534 street CA 2014 2 . . . .
    2423 lane OH 2009 . . . 1 50
    Hopefully, this makes some sense, I'm having a bunch of trouble trying to google this, so any help or guidance would be appreciated.

    Best,
    Max

  • #2
    You have 7 posts on Statalist, so you should be familiar with using dataex to present data examples by now. Please do so in the future. You can modify Table 2 and merge with the modified table.

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input byte policytype str2 state int year str3 policydetail2 byte policydetail3
    1 "CA" 2008 "."    .
    1 "CA" 2005 "."    .
    2 "RI" 2009 "Abc"  .
    3 "OH" 2001 "."   50
    end
    
    *KEEP ONLY EARLIEST OBSERVATIONS PER STATE
    bys state (year): keep if _n==1
    
    *DUPLICATE OBSERVATIONS (choose a sufficiently large expansion)
    expand 20
    
    *REPLACE SUBSEQUENT YEARS
    bys state (year): replace year=year[_n-1]+1 if _n>1
    
    *DROP OBSERVATIONS BEYOND YOUR REQUIRED YEAR
    drop if year>2017
    Now you can merge directly with the following (identifying observations with state and year) and only keep matches.

    Code:
    . l, sepby(state)
    
         +-----------------------------------------------+
         | policy~e   state   year   policy~2   policy~3 |
         |-----------------------------------------------|
      1. |        1      CA   2005          .          . |
      2. |        1      CA   2006          .          . |
      3. |        1      CA   2007          .          . |
      4. |        1      CA   2008          .          . |
      5. |        1      CA   2009          .          . |
      6. |        1      CA   2010          .          . |
      7. |        1      CA   2011          .          . |
      8. |        1      CA   2012          .          . |
      9. |        1      CA   2013          .          . |
     10. |        1      CA   2014          .          . |
     11. |        1      CA   2015          .          . |
     12. |        1      CA   2016          .          . |
     13. |        1      CA   2017          .          . |
         |-----------------------------------------------|
     14. |        3      OH   2001          .         50 |
     15. |        3      OH   2002          .         50 |
     16. |        3      OH   2003          .         50 |
     17. |        3      OH   2004          .         50 |
     18. |        3      OH   2005          .         50 |
     19. |        3      OH   2006          .         50 |
     20. |        3      OH   2007          .         50 |
     21. |        3      OH   2008          .         50 |
     22. |        3      OH   2009          .         50 |
     23. |        3      OH   2010          .         50 |
     24. |        3      OH   2011          .         50 |
     25. |        3      OH   2012          .         50 |
     26. |        3      OH   2013          .         50 |
     27. |        3      OH   2014          .         50 |
     28. |        3      OH   2015          .         50 |
     29. |        3      OH   2016          .         50 |
     30. |        3      OH   2017          .         50 |
         |-----------------------------------------------|
     31. |        2      RI   2009        Abc          . |
     32. |        2      RI   2010        Abc          . |
     33. |        2      RI   2011        Abc          . |
     34. |        2      RI   2012        Abc          . |
     35. |        2      RI   2013        Abc          . |
     36. |        2      RI   2014        Abc          . |
     37. |        2      RI   2015        Abc          . |
     38. |        2      RI   2016        Abc          . |
     39. |        2      RI   2017        Abc          . |
         +-----------------------------------------------+

    Comment


    • #3
      The obstacle here is not so much conceptualizing the transformation, but the fact that the data organization is very awkward. Also, what you are showing is not, I think, really like your data--or at least I hope not, because if it is, you have, on top of everything else, incompatibility between string and numeric corresponding variables in the two data sets.

      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input str10 address str2 state int year byte policytype1 str1 policytype1detail byte policytype2 str3 policytype2detail byte policytype3 str3 policytype3detail
      "12 road"    "CA" 2006 . "" . "" . ""
      "234 st"     "RI" 2012 . "" . "" . ""
      "3534 ave"   "RI" 2007 . "" . "" . ""
      "125 road"   "WV" 2016 . "" . "" . ""
      "534 street" "CA" 2014 . "" . "" . ""
      "2423 lane"  "OH" 2009 . "" . "" . ""
      end
      tempfile primary
      save `primary'
      
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input byte policytype str2 state int year str1 policydetail1 str3 policydetail2 str2 policydetail3
      1 "CA" 2008 "" ""   "" 
      1 "CA" 2005 "" ""   "" 
      2 "RI" 2009 "" "Abc" "" 
      3 "OH" 2001 "" ""   "50"
      end
      tempfile secondary
      save `secondary'
      
      //    RE-ORGANIZE SECONDARY DATA TO FULLY LONG LAYOUT
      use `secondary', clear
      gen detail = ""
      forvalues i = 1/3 {
          replace detail = policydetail`i' if policytype == `i'
      }
      drop policydetail*
      tempfile secondary_long
      save `secondary_long'
      
      //    RE-ORGANIZE THE PRIMARY DATA TO LONG
      use `primary', clear
      rename policytype* has*
      rename has*detail detail*
      reshape long has detail, i(address state year) j(policytype)
      replace has = 0 if missing(has)    // SO WE WILL HAVE 0/1 NOT ./1 IN THE END
      
      //    NOW COMBINE THEM
      rangejoin year 0 . using `secondary_long', by(state policytype)
      replace has = 1 if !missing(year_U)
      replace detail = detail + detail_U if !missing(year_U)
      drop *_U
      reshape wide
      -rangejoin- is written by Robert Picard, and is available from SSC.

      I strongly recommend you skip the final -reshape wide- command and leave the data in long layout, as most Stata commands work better, or only, with long data. I included the -reshape wide- only so that you can see that the code accomplishes the transformation you wanted.

      Note: In creating the datasets to demonstrate this code, I have taken some liberties with your original data, for example, creating some string variables where you had numeric missing values. You may need to do a bit of work to make your data conform to what I have suggested.

      In the future, when showing data examples, please use the -dataex- command to do so, as I have here. If you are running version 15.1 or a fully updated version 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

      When asking for help with code, always show example data. When showing example data, always use -dataex-.

      Comment


      • #4
        please I need users guide on dataex

        Comment


        • #5
          #4 is unrelated to the topic of this thread. It is important to keep threads on topic. The Forum is not just a series of dialogs between questioners and responders. Other people come to read things here, either on a regular basis, or by searching particular topics of interest. In order for those people to know what is worth reading (for them) and what is not, the content of the threads has to be closely related to the title.

          Please repost this as a new topic.

          Comment

          Working...
          X