Using a secondary dataset to generate new variables to main dataset given conditions

Max Ortel

Join Date: Nov 2018
Posts: 8

Using a secondary dataset to generate new variables to main dataset given conditions

20 Mar 2019, 14:25

Hi all,

So as usual, I'm having a hard time describing what I'm trying to do, which I believe is why googling has been ineffective for me. So let me use some examples.

I have a primary dataset that look like this:

Table 1

Address	State	Year	PolicyType1	PolicyType2	PolicyType2 Detail	PolicyType3	PolicyType3 Detail
12 road	CA	2006	.	.	.	.	.
234 st	RI	2012	.	.	.	.	.
3534 ave	RI	2007	.	.	.	.	.
125 road	WV	2016	.	.	.	.	.
534 street	CA	2014	.	.	.	.	.
2423 lane	OH	2009	.	.	.	.	.

I also have a secondary dataset that looks like this

Table 2

PolicyType	State	Year	PolicyDetail2	PolicyDetail3
1	CA	2008	.	.
1	CA	2005	.	.
2	RI	2009	Abc	.
3	OH	2001	.	50

What I want is to combine these datasets so that each observation in table 1 gets a value added to the policy (counting number of policies) if it is in or after the year of the policy and in the state. I also want the corresponding policy detail to be added. I know that's a bit confusing, so in the example above, I would want table 1 to be:

Table 1

Address	State	Year	PolicyType1	PolicyType2	PolicyType2 Detail	PolicyType3	PolicyType3 Detail
12 road	CA	2006	1	.	.	.	.
234 st	RI	2012	.	1	Abc	.	.
3534 ave	RI	2007	.	.	.	.	.
125 road	WV	2016	.	.	.	.	.
534 street	CA	2014	2	.	.	.	.
2423 lane	OH	2009	.	.	.	1	50

Hopefully, this makes some sense, I'm having a bunch of trouble trying to google this, so any help or guidance would be appreciated.

Best,
Max

Tags: None

Andrew Musau

Join Date: Oct 2014
Posts: 10192

20 Mar 2019, 15:21

You have 7 posts on Statalist, so you should be familiar with using dataex to present data examples by now. Please do so in the future. You can modify Table 2 and merge with the modified table.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input byte policytype str2 state int year str3 policydetail2 byte policydetail3
1 "CA" 2008 "."    .
1 "CA" 2005 "."    .
2 "RI" 2009 "Abc"  .
3 "OH" 2001 "."   50
end

*KEEP ONLY EARLIEST OBSERVATIONS PER STATE
bys state (year): keep if _n==1

*DUPLICATE OBSERVATIONS (choose a sufficiently large expansion)
expand 20

*REPLACE SUBSEQUENT YEARS
bys state (year): replace year=year[_n-1]+1 if _n>1

*DROP OBSERVATIONS BEYOND YOUR REQUIRED YEAR
drop if year>2017

Now you can merge directly with the following (identifying observations with state and year) and only keep matches.

Code:

. l, sepby(state)

     +-----------------------------------------------+
     | policy~e   state   year   policy~2   policy~3 |
     |-----------------------------------------------|
  1. |        1      CA   2005          .          . |
  2. |        1      CA   2006          .          . |
  3. |        1      CA   2007          .          . |
  4. |        1      CA   2008          .          . |
  5. |        1      CA   2009          .          . |
  6. |        1      CA   2010          .          . |
  7. |        1      CA   2011          .          . |
  8. |        1      CA   2012          .          . |
  9. |        1      CA   2013          .          . |
 10. |        1      CA   2014          .          . |
 11. |        1      CA   2015          .          . |
 12. |        1      CA   2016          .          . |
 13. |        1      CA   2017          .          . |
     |-----------------------------------------------|
 14. |        3      OH   2001          .         50 |
 15. |        3      OH   2002          .         50 |
 16. |        3      OH   2003          .         50 |
 17. |        3      OH   2004          .         50 |
 18. |        3      OH   2005          .         50 |
 19. |        3      OH   2006          .         50 |
 20. |        3      OH   2007          .         50 |
 21. |        3      OH   2008          .         50 |
 22. |        3      OH   2009          .         50 |
 23. |        3      OH   2010          .         50 |
 24. |        3      OH   2011          .         50 |
 25. |        3      OH   2012          .         50 |
 26. |        3      OH   2013          .         50 |
 27. |        3      OH   2014          .         50 |
 28. |        3      OH   2015          .         50 |
 29. |        3      OH   2016          .         50 |
 30. |        3      OH   2017          .         50 |
     |-----------------------------------------------|
 31. |        2      RI   2009        Abc          . |
 32. |        2      RI   2010        Abc          . |
 33. |        2      RI   2011        Abc          . |
 34. |        2      RI   2012        Abc          . |
 35. |        2      RI   2013        Abc          . |
 36. |        2      RI   2014        Abc          . |
 37. |        2      RI   2015        Abc          . |
 38. |        2      RI   2016        Abc          . |
 39. |        2      RI   2017        Abc          . |
     +-----------------------------------------------+

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#3

20 Mar 2019, 15:54

The obstacle here is not so much conceptualizing the transformation, but the fact that the data organization is very awkward. Also, what you are showing is not, I think, really like your data--or at least I hope not, because if it is, you have, on top of everything else, incompatibility between string and numeric corresponding variables in the two data sets.

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input str10 address str2 state int year byte policytype1 str1 policytype1detail byte policytype2 str3 policytype2detail byte policytype3 str3 policytype3detail "12 road" "CA" 2006 . "" . "" . "" "234 st" "RI" 2012 . "" . "" . "" "3534 ave" "RI" 2007 . "" . "" . "" "125 road" "WV" 2016 . "" . "" . "" "534 street" "CA" 2014 . "" . "" . "" "2423 lane" "OH" 2009 . "" . "" . "" end tempfile primary save `primary' * Example generated by -dataex-. To install: ssc install dataex clear input byte policytype str2 state int year str1 policydetail1 str3 policydetail2 str2 policydetail3 1 "CA" 2008 "" "" "" 1 "CA" 2005 "" "" "" 2 "RI" 2009 "" "Abc" "" 3 "OH" 2001 "" "" "50" end tempfile secondary save `secondary' // RE-ORGANIZE SECONDARY DATA TO FULLY LONG LAYOUT use `secondary', clear gen detail = "" forvalues i = 1/3 { replace detail = policydetail`i' if policytype == `i' } drop policydetail* tempfile secondary_long save `secondary_long' // RE-ORGANIZE THE PRIMARY DATA TO LONG use `primary', clear rename policytype* has* rename has*detail detail* reshape long has detail, i(address state year) j(policytype) replace has = 0 if missing(has) // SO WE WILL HAVE 0/1 NOT ./1 IN THE END // NOW COMBINE THEM rangejoin year 0 . using `secondary_long', by(state policytype) replace has = 1 if !missing(year_U) replace detail = detail + detail_U if !missing(year_U) drop *_U reshape wide

-rangejoin- is written by Robert Picard, and is available from SSC.

I strongly recommend you skip the final -reshape wide- command and leave the data in long layout, as most Stata commands work better, or only, with long data. I included the -reshape wide- only so that you can see that the code accomplishes the transformation you wanted.

Note: In creating the datasets to demonstrate this code, I have taken some liberties with your original data, for example, creating some string variables where you had numeric missing values. You may need to do a bit of work to make your data conform to what I have suggested.

In the future, when showing data examples, please use the -dataex- command to do so, as I have here. If you are running version 15.1 or a fully updated version 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

When asking for help with code, always show example data. When showing example data, always use -dataex-.
Comment
Olayiwola Adetutu

Join Date: Sep 2019

Posts: 59
#4

22 Jan 2020, 08:23

please I need users guide on dataex
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#5

22 Jan 2020, 14:07

#4 is unrelated to the topic of this thread. It is important to keep threads on topic. The Forum is not just a series of dialogs between questioners and responders. Other people come to read things here, either on a regular basis, or by searching particular topics of interest. In order for those people to know what is worth reading (for them) and what is not, the content of the threads has to be closely related to the title.

Please repost this as a new topic.
Comment

Announcement

Using a secondary dataset to generate new variables to main dataset given conditions

Comment

Comment

Comment

Comment