How to balance this dataset

Ben Fairbanks

Join Date: Nov 2020

Posts: 5
#1

How to balance this dataset

02 Dec 2020, 20:14

Hi all

I am having a hard time thinking about the best way to balance this dataset. The format is:
plan ID State Year Drug

1a AL 2012 Abilify

1a AL 2012 Humalog

2a AL 2012 Abilify

2a AL 2012 Novolog

1a AL 2013 Abilify

1a AL 2013 Humalog

1a AL 2013 Humira

I need each plan/state/year to have an observation for every drug that is listed on any plan in that year. In my example table, Plan 1a in AL in 2012 would need to also have a row for Novolog, because Plan 2a has Novolog in 2012. Plan 2a in AL in 2012 would conversely need a row for Humalog, because Plan 1a has it in that year.

Any advice for how I could code this? Much appreciated, thank you!
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#2

02 Dec 2020, 21:37

The substance of this is handled with the -fillin- command. However, you don't want to completely rectangularize the data, because you don't need any Novolog observations in 2013. So you want to rectangularize your data set separately for each year. Unfortunately, -fillin- does not support the -by- prefix. This is a job for -runby-

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input str3(planid state) int year str7 drug "1a " "AL " 2012 "Abilify" "1a " "AL " 2012 "Humalog" "2a " "AL " 2012 "Abilify" "2a " "AL " 2012 "Novolog" "1a " "AL " 2013 "Abilify" "1a " "AL " 2013 "Humalog" "1a " "AL " 2013 "Humira" end capture program drop one_year program define one_year fillin planid state year drug exit end runby one_year, by(year)

-runby- is written by Robert Picard and me, and is available from SSC.

In the future, when showing data examples, please use the -dataex- command to do so. If you are running version 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.
Comment
Ben Fairbanks

Join Date: Nov 2020

Posts: 5
#3

03 Dec 2020, 08:59

Originally posted by Clyde Schechter View Post

The substance of this is handled with the -fillin- command. However, you don't want to completely rectangularize the data, because you don't need any Novolog observations in 2013. So you want to rectangularize your data set separately for each year. Unfortunately, -fillin- does not support the -by- prefix. This is a job for -runby-

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input str3(planid state) int year str7 drug "1a " "AL " 2012 "Abilify" "1a " "AL " 2012 "Humalog" "2a " "AL " 2012 "Abilify" "2a " "AL " 2012 "Novolog" "1a " "AL " 2013 "Abilify" "1a " "AL " 2013 "Humalog" "1a " "AL " 2013 "Humira" end capture program drop one_year program define one_year fillin planid state year drug exit end runby one_year, by(year)

-runby- is written by Robert Picard and me, and is available from SSC.

In the future, when showing data examples, please use the -dataex- command to do so. If you are running version 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

Thank you for responding! So, would I need to manually enter every drug name for each plan/state/year? My dataset has 38E6 observations so that is not really an option unfortunately.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#4

03 Dec 2020, 18:02

No, no, no, no, no. Everything in the code from *Example generated by dataex down through the first -end- command is just a way to get some example data into Stata to illustrate the code. Replace all of that with just -use-ing your actual data set. In other words, start from where it says -capture program drop one_year- and take it from there after loading your data set into Stata.

-dataex- is a convenience command for Statalist users. What you saw there is a short block of code that loads a small data set into Stata. -dataex- is a program that produces code like that from a real data set. That way people asking questions on Statalist can show an example of their data in a way that adequately conveys all the information necessary for others to work with it, and those who answer questions can use it to replicate that example data in their own Stata setup to try out code on it.
Comment
Ben Fairbanks

Join Date: Nov 2020

Posts: 5
#5

04 Dec 2020, 16:45

Originally posted by Clyde Schechter View Post

No, no, no, no, no. Everything in the code from *Example generated by dataex down through the first -end- command is just a way to get some example data into Stata to illustrate the code. Replace all of that with just -use-ing your actual data set. In other words, start from where it says -capture program drop one_year- and take it from there after loading your data set into Stata.

-dataex- is a convenience command for Statalist users. What you saw there is a short block of code that loads a small data set into Stata. -dataex- is a program that produces code like that from a real data set. That way people asking questions on Statalist can show an example of their data in a way that adequately conveys all the information necessary for others to work with it, and those who answer questions can use it to replicate that example data in their own Stata setup to try out code on it.

Haha I must be sleep deprived, sorry for the misunderstanding! I am trying to run the code now, and it has been running for almost an hour. Would you expect that command to take a while with a large dataset?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#6

05 Dec 2020, 10:31

In an data set with 38,000,000 observations, yes I would expect this to take a very long time. It might well be days, rather than hours. If, by the time you get this, it has not finished and you are concerned that it is hung, and are willing to start over, add the -status- option to the -runby- command. That way you will get a periodic progress report showing how many observations have been processed and an estimate of the remaining time.
Comment
Ben Fairbanks

Join Date: Nov 2020

Posts: 5
#7

06 Dec 2020, 19:30

Thanks for the continued help here. The code finally finished running but I got a r(3900) error:

store_data(): 3900 unable to allocate string <tmp>[1614142241,1]
runby_main(): - function returned error
<istmt>: - function returned error

Any ideas?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#8

06 Dec 2020, 20:53

Not a clue. I've never seen that before. Sounds like a memory issue, but I can't be sure. Did you get results, or did the code stop without finishing the job?
Comment
Ben Fairbanks

Join Date: Nov 2020

Posts: 5
#9

07 Dec 2020, 13:54

It didn't have any other output than the message I copied in my previous post, but it seemed like the code was able to run? Not exactly sure. It did say "end of do-file" so I think that means it finished.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#10

07 Dec 2020, 14:39

I'm pretty sure it's a memory issue, since the first of those messages says that it tried to create a 1.64 billion by 1 matrix of strings in mata and failed. Which makes perfect sense to me.

There are two possibilities here. One is that the resulting data set would be too large for Stata no matter how you tried to build it. In that case, it isn't going to happen and we can quit trying now. So you should do a back of the envelope calculation of the number of observations that will be in the resulting data set and compare that to the limit for your flavor of Stata. (See -help limits- to find how many observations you can have.) Also make sure you have enough space on your mass storage device to save the file once it is created in memory.

More optimistically, it can be done but needs another way that will be gentler on memory requirements along the way. So what I would do is break up the file you are starting from into several smaller files. Each of the smaller files should consist of all observations for a selected range of years. So, if the years in your data range from, say, 2000-2020, I would make one data set for 2000-2004, another for 2005-2009, etc. Then run the code from #3 separately for each ofr these smaller data sets. And then append all the results together.

Added: Oh, and remember to add the -status- option to the -runby- command so you can see how things are progressing as you run.
Comment

plan ID	State	Year	Drug
1a	AL	2012	Abilify
1a	AL	2012	Humalog
2a	AL	2012	Abilify
2a	AL	2012	Novolog
1a	AL	2013	Abilify
1a	AL	2013	Humalog
1a	AL	2013	Humira

Announcement

How to balance this dataset

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment