Variable generation

Federico Nutarelli

Join Date: Sep 2018
Posts: 430

Variable generation

15 Jan 2019, 11:45

Hi Statlist,

so, I have a problem constructing two variables. I briefly describe my data here below: the database consists of a series of firms making products through molecules. So to each firm can belong different products using different molecules each one. The variable data_lancio describes the launch date of the product and of the associated molecule, while the variable Year represents the year in which the other variables (not displayed in the data) are available. The dataex his here represented:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input double(idfirm idproduct) float(id_molecule Year) int data_lancio
1  287 1104 2013 2013
1  287 1104 2013 2013
1  287 1104 2013 2013
1  287 1104 2013 2013
1  287 1104 2013 2013
1  287 1104 2013 2013
1  287 1104 2013 2013
1  287 1104 2013 2013
1  287 1104 2014 2013
1  287 1104 2014 2013
1  287 1104 2014 2013
1  287 1104 2014 2013
1  287 1104 2014 2013
1  287 1104 2014 2013
1  287 1104 2014 2013
1  287 1104 2014 2013
1  287 1104 2015 2013
1  287 1104 2015 2013
1  287 1104 2015 2013
1  287 1104 2015 2013
1  287 1104 2015 2013
1  287 1104 2015 2013
1  287 1104 2015 2013
1  287 1104 2015 2013
1  287 1104 2015 2013
1 1474  895 2013 2013
1 1474  895 2013 2013
1 1474  895 2014 2013
1 1474  895 2014 2013
1 1474  895 2014 2013
1 1474  895 2014 2013
1 1474  895 2014 2013
1 1474  895 2014 2013
1 1474  895 2014 2013
1 1474  895 2014 2013
1 1474  895 2015 2013
1 1474  895 2015 2013
1 1474  895 2015 2013
1 1474  895 2015 2013
1 1474  895 2015 2013
1 1474  895 2015 2013
1 1474  895 2015 2013
1 3026  301 2014 2014
1 3026  301 2014 2014
1 3026  301 2014 2014
1 3026  301 2014 2014
1 3026  301 2014 2014
1 3026  301 2014 2014
1 3026  301 2014 2014
1 3026  301 2014 2014
1 3026  301 2014 2014
1 3026  301 2015 2014
1 3026  301 2015 2014
1 3026  301 2015 2014
1 3026  301 2015 2014
1 3026  301 2015 2014
1 3026  301 2015 2014
1 3026  301 2015 2014
1 3026  301 2015 2014
1 3026  301 2015 2014
1 3026  301 2015 2014
1 3026  301 2015 2014
1 3026  301 2015 2014
1 3026  301 2015 2014
1 3026  301 2015 2014
1 3026  301 2015 2014
1 3026  301 2015 2014
1 3026  301 2015 2014
1 3026  301 2015 2014
1 3026  301 2015 2014
1 3026  301 2015 2014
1 3026  301 2015 2014
1 3026  301 2015 2014
1 3026  301 2015 2014
1 3026  301 2015 2014
1 3051  301 2008 2008
1 3051  301 2008 2008
1 3051  301 2008 2008
1 3051  301 2008 2008
1 3051  301 2008 2008
1 3051  301 2008 2008
1 3051  301 2008 2008
1 3051  301 2009 2008
1 3051  301 2009 2008
1 3051  301 2009 2008
1 3051  301 2009 2008
1 3051  301 2009 2008
1 3051  301 2009 2008
1 3051  301 2009 2008
1 3051  301 2009 2008
1 3051  301 2009 2008
1 3051  301 2009 2008
1 3051  301 2009 2008
1 3051  301 2010 2008
1 3051  301 2010 2008
1 3051  301 2010 2008
1 3051  301 2010 2008
1 3051  301 2010 2008
1 3051  301 2010 2008
1 3051  301 2010 2008
end

Now, my aim is to generate two dummies: the first, taking on 1 if the molecule is observed for the first time in all the sample and 0 otherwise (so for instance molecule 736 is observed for the first time in the sample in 1986 so I would like to have 1 for 736 just in that launch date) and the other if the molecule is observed for the first time in a firm (so 1 if it is the first time that the ith firm employs the molecule and 0 otherwise). This I tried to accomplish UNSUCCESFULLY with the following:

Code:

bysort id_molecule data_lancio: gen counter_new_marke =_n == 1 // questa dovrebbe essere per new to the market
bysort idfirm id_molecule data_lancio (idpr) : gen counter_new_fir = _n == 1 //questa dvrebbe essere per new_to the firm

Another complication comes from the fact that after I had done the counter I need to collapse twice the dataset, the first time by idfirm idproduct and Year:

Code:

collapse (sum) salesmnf counter_new_fir counter_new_marke (min) anno_numeric data_lancio (first) id_molecule firstposition molecule atc4 crp prd internationalp seq (last) sequence ,by (idfirm idprod Year)

and the second time by idfirm and Year:

Code:

collapse (mean) avsales avsales_existing avsales_new (first) agepr molecule idprod salesmnf data_lancio anno_numeric firstpos atc4 crp prd internationalp (max) numero_nuovi, by(idfirm Year)

Can someone please help me creating the dummy variable for new molecule to the market and to the firm and keep it after the collapses?

Last edited by Federico Nutarelli; 15 Jan 2019, 11:56.

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 28556
#2

15 Jan 2019, 21:50

The creation of the two variables you want, if I understand you correctly, can be done with:

Code:

by id_molecule, sort: egen first_year_molecule = min(Year) by idproduct, sort: egen first_year_product = min(Year)

Unfortunately, your example explaining how the results should be for molecule 736 is not helpful, because molecule 736 does not appear in your example data. Also, it is unclear whether "first year in the data set" refers to the variable Year or data_lancio. I assumed it refers to year.

Concerning your problems with -collapse-, I am unable to infer from your description and the code you attempted just what you want the result to look like. Please hand work a small example of the desired results and post back with that.

Finally, I note that you post frequently with questions whose answer involves basic -egen- functions. While there is nothing wrong with that, you might find that you make faster progress by investing time reading -help egen- and the associated section of the PDF documentation that is installed with Stata. (Near the top of -help egen- you will find a link to the documentation in blue.) That way you don't have to sit around all day waiting and hoping that somebody here will answer your question. If you are going to be using Stata regularly, being familiar with most of the -egen- functions is indispensible: they are part of the bedrock of data management. The time you spend learning them will be amply repaid in short order.
Comment

Federico Nutarelli

Join Date: Sep 2018
Posts: 430

16 Jan 2019, 08:56

Sorry for having been unclear and thanks for the replay!

So first year in the data set refers to "data_lancio" in the case of the new to the market molecule (I would like to have 1 if it is the very first time that the molecule appears in the market) and regarding the firm I would like to have 1 the first time the molecule is used by the firm (so it will be Year).

For what concerns even function I will definitely do it. Indeed I am new in STATA and much of the time when I make questions I do not know which is the correct function to use. So yes, I will definitely read the PDF though I will do some questions in the forum looking fo further insights. Many thanks again

From the example I have I will make clear what I would like the result to look like:
new_to_the_firm_molecule_dummy (last variable):

Code:

 
 1  287 1104 2013 2013 1 1  287 1104 2013 2013 1 1  287 1104 2013 2013 1 1  287 1104 2013 2013 1 1  287 1104 2013 2013 1 1  287 1104 2013 2013 1  1  287 1104 2013 2013 1 1  287 1104 2013 2013 1 1  287 1104 2014 2013 1 1  287 1104 2014 2013 1 1  287 1104 2014 2013 1 1  287 1104 2014 2013 1 1  287 1104 2014 2013 1 1  287 1104 2014 2013 1 1  287 1104 2014 2013 1 1  287 1104 2014 2013 1 1  287 1104 2015 2013 1 1  287 1104 2015 2013 1 1  287 1104 2015 2013 1 1  287 1104 2015 2013 1 1  287 1104 2015 2013 1 1  287 1104 2015 2013 1 1  287 1104 2015 2013 1 1  287 1104 2015 2013 1 1  287 1104 2015 2013 1 1 1474  895 2013 2013 1 1 1474  895 2013 2013 1 1 1474  895 2014 2013 1 1 1474  895 2014 2013 1 1 1474  895 2014 2013 1 1 1474  895 2014 2013 1 1 1474  895 2014 2013 1 1 1474  895 2014 2013 1 1 1474  895 2014 2013 1 1 1474  895 2014 2013 1 1 1474  895 2015 2013 1 1 1474  895 2015 2013 1 1 1474  895 2015 2013 1 1 1474  895 2015 2013 1 1 1474  895 2015 2013 1 1 1474  895 2015 2013 1 1 1474  895 2015 2013 1 1 3026  301 2014 2014 0 1 3026  301 2014 2014 0  1 3026  301 2014 2014 0 1 3026  301 2014 2014 0  1 3026  301 2014 2014 0 1 3026  301 2014 2014 0 1 3026  301 2014 2014 0 1 3026  301 2014 2014 0 1 3026  301 2014 2014 0 1 3026  301 2015 2014 0 1 3026  301 2015 2014 0 1 3026  301 2015 2014 0 1 3026  301 2015 2014 0 1 3026  301 2015 2014 0 1 3026  301 2015 2014 0 1 3026  301 2015 2014 0 1 3026  301 2015 2014 0 1 3026  301 2015 2014 0 1 3026  301 2015 2014 0 1 3026  301 2015 2014 0 1 3026  301 2015 2014 0 1 3026  301 2015 2014 0 1 3026  301 2015 2014 0 1 3026  301 2015 2014 0 1 3026  301 2015 2014 0  1 3026  301 2015 2014 0 1 3026  301 2015 2014 0 1 3026  301 2015 2014 0  1 3026  301 2015 2014 0 1 3026  301 2015 2014 0 1 3026  301 2015 2014 0 1 3026  301 2015 2014 0 1 3026  301 2015 2014 0 1 3051  301 2008 2008 1 1 3051  301 2008 2008 1 1 3051  301 2008 2008 1 1 3051  301 2008 2008 1 1 3051  301 2008 2008 1 1 3051  301 2008 2008 1 1 3051  301 2008 2008 1 1 3051  301 2009 2008 1 1 3051  301 2009 2008 1 1 3051  301 2009 2008 1 1 3051  301 2009 2008 1 1 3051  301 2009 2008 1 1 3051  301 2009 2008 1 1 3051  301 2009 2008 1 1 3051  301 2009 2008 1 1 3051  301 2009 2008 1 1 3051  301 2009 2008 1 1 3051  301 2009 2008 1 1 3051  301 2010 2008 1 1 3051  301 2010 2008 1 1 3051  301 2010 2008 1 1 3051  301 2010 2008 1 1 3051  301 2010 2008 1 1 3051  301 2010 2008 1 1 3051  301 2010 2008 1 end

For the new to the molecule, I would like it is 1 for the min(data_lancio)

Last edited by Federico Nutarelli; 16 Jan 2019, 09:12.

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 28556
#4

16 Jan 2019, 09:31

OK, and I see I misread your question in #1. It should be:

Code:

by id_molecule, sort: egen molecule_introduced = min(data_lancio) by firm_id idmolecule, sort: egen molecule_first_this_firm = min(data_lancio)

Concerning the -collapse- version, something went wrong with your attempt to show what you want. The code block just shows a long list of numbers, with no variable names. It's not workable. Please retry that.
Comment

Federico Nutarelli

Join Date: Sep 2018
Posts: 430

16 Jan 2019, 11:02

To be clearer:
After the code:

Code:

by id_molecule, sort: egen first_year_molecule = min(data_lancio)
by idfirm id_molecule, sort: egen first_year_product = min(Year)

gen counter_new_marke = 0
gen counter_new_fir = 0
replace counter_new_marke = 1 if data_lancio == first_year_molecule
replace counter_new_fir = 1 if Year == first_year_product

sort idfirm idproduct Year
drop first_year*

collapse (sum) salesmnf (min)  data_lancio (first) counter_new_fir counter_new_marke id_molecule,by (idfirm idprod Year)

I have obtained the following:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input double idfirm float id_molecule int data_lancio float(counter_new_fir counter_new_marke Year)
1  895 2008 1 0 2008
1  301 2008 1 0 2008
1 1223 2008 1 0 2008
1  434 2008 1 0 2008
1  895 2008 0 0 2009
1 1223 2008 0 0 2009
1  434 2008 0 0 2009
1  301 2008 0 0 2009
1  895 2008 0 0 2010
1  301 2008 0 0 2010
1  832 2010 1 0 2010
1 1223 2008 0 0 2010
1  434 2008 0 0 2010
1  832 2010 0 0 2011
1  301 2008 0 0 2011
1  434 2008 0 0 2011
1  895 2008 0 0 2011
1 1027 2011 1 0 2011
1 1223 2008 0 0 2011
1 1819 2011 1 0 2011
1  832 2010 0 0 2012
1  800 2012 1 0 2012
1  639 2012 1 0 2012
1 1027 2011 0 0 2012
1  301 2008 0 0 2012
1 1223 2008 0 0 2012
1  895 2008 0 0 2012
1  434 2008 0 0 2012
1  761 2012 1 0 2012
1  301 2008 0 0 2013
1  895 2013 0 0 2013
1 1223 2008 0 0 2013
1  761 2012 0 0 2013
1 1027 2011 0 0 2013
1  800 2012 0 0 2013
1  160 2010 0 0 2013
1 1104 2013 1 0 2013
1  895 2008 0 0 2013
1 1194 2013 1 0 2013
1  519 2013 1 0 2013
1  639 2012 0 0 2013
1  434 2008 0 0 2013
1 1155 2013 1 0 2013
1 1070 2014 1 0 2014
1 1104 2013 0 0 2014
1 1027 2011 0 0 2014
1  800 2014 0 0 2014
1  761 2012 0 0 2014
1  895 2008 0 0 2014
1  895 2013 0 0 2014
1 2132 2014 1 0 2014
1  301 2014 0 0 2014
1  434 2008 0 0 2014
1  639 2012 0 0 2014
1  832 2014 0 0 2014
1 1194 2013 0 0 2014
1  519 2013 0 0 2014
1  800 2012 0 0 2014
1  160 2010 0 0 2014
1 1223 2008 0 0 2014
1  301 2008 0 0 2014
1 1155 2013 0 0 2014
1  639 2012 0 0 2015
1  832 2014 0 0 2015
1  160 2010 0 0 2015
1 1104 2013 0 0 2015
1 1223 2008 0 0 2015
1  800 2014 0 0 2015
1  434 2008 0 0 2015
1 1070 2014 0 0 2015
1 1194 2013 0 0 2015
1  895 2013 0 0 2015
1  301 2008 0 0 2015
1 1155 2013 0 0 2015
1  800 2012 0 0 2015
1 2132 2014 0 0 2015
1  519 2013 0 0 2015
1 1027 2011 0 0 2015
1  761 2012 0 0 2015
1  301 2014 0 0 2015
2 1865 1975 1 1 2004
2  395 2002 1 0 2004
2  301 1976 1 0 2004
2 1392 2002 1 0 2004
2 1795 1988 1 0 2004
2  746 2002 1 0 2004
2  395 2002 0 0 2005
2 1392 2002 0 0 2005
2  746 2002 0 0 2005
2 1865 1975 0 1 2005
2 1795 1988 0 0 2005
2  301 1976 0 0 2005
2 1795 1988 0 0 2006
2 1392 2002 0 0 2006
2  301 1976 0 0 2006
2  746 2002 0 0 2006
2  395 2002 0 0 2006
2 1439 1975 0 0 2006
2  395 2002 0 0 2007
2  301 1976 0 0 2007
end

In the collapse by idfirm and Year it will result a panel of firms repeated in time from 2004 to 2015. I would like to maintain counter_new_fir = 1 if a firm has introduced a new_molecule for the first time in that year (e.g. firm 1 will have in the case above 1 in year 2008 2010 2011 2012 2013 and 2014 and 0 otherwise, that is in 2009 and 2014) and new_mol_marke should be 1 if a firm has introduced a molecule never introduced before in the sample in a Year (so if for instance molecule 895 had been introduced for the first time in all the sample by firm 1 in 2008, the variable should take on value 1 for firm 1 in 2008; if firm 2 has introduced a molecule for the very first time in the sample in a Year, again the variable should take on value 1 in that Year for firm 2 and so on...)

Thank you again!

Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 28556

16 Jan 2019, 12:22

I'm sorry, but I find this very confusing. The names you are giving your variables, to me, seem to be quite different from what the variables actually represent according to the code, so I am not sure if you are coding them wrong, or just naming them strangely. And I still don't grasp what you want the end result to look like. Will it look something like this (ignoring whether the actual numbers are correct--I'm just talking about the layout and the panel structure of one observation per firm-year):

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input double idfirm float Year byte introduced_a_new_molecule
1 2008 1
1 2009 0
1 2010 0
1 2013 1
1 2014 0
1 2015 0
end

And if so, is the last variable, introduced_a_new_molecule supposed to mean that there was some molecule in the data set which was associated with this firm in this year but does not occur in the data set in any earlier year?

If so, I think, going back to the data example you gave in #1, then the entire sequence of code should be

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input double(idfirm idproduct) float(id_molecule Year) int data_lancio
1  287 1104 2013 2013
1  287 1104 2013 2013
1  287 1104 2013 2013
1  287 1104 2013 2013
1  287 1104 2013 2013
1  287 1104 2013 2013
1  287 1104 2013 2013
1  287 1104 2013 2013
1  287 1104 2014 2013
1  287 1104 2014 2013
1  287 1104 2014 2013
1  287 1104 2014 2013
1  287 1104 2014 2013
1  287 1104 2014 2013
1  287 1104 2014 2013
1  287 1104 2014 2013
1  287 1104 2015 2013
1  287 1104 2015 2013
1  287 1104 2015 2013
1  287 1104 2015 2013
1  287 1104 2015 2013
1  287 1104 2015 2013
1  287 1104 2015 2013
1  287 1104 2015 2013
1  287 1104 2015 2013
1 1474  895 2013 2013
1 1474  895 2013 2013
1 1474  895 2014 2013
1 1474  895 2014 2013
1 1474  895 2014 2013
1 1474  895 2014 2013
1 1474  895 2014 2013
1 1474  895 2014 2013
1 1474  895 2014 2013
1 1474  895 2014 2013
1 1474  895 2015 2013
1 1474  895 2015 2013
1 1474  895 2015 2013
1 1474  895 2015 2013
1 1474  895 2015 2013
1 1474  895 2015 2013
1 1474  895 2015 2013
1 3026  301 2014 2014
1 3026  301 2014 2014
1 3026  301 2014 2014
1 3026  301 2014 2014
1 3026  301 2014 2014
1 3026  301 2014 2014
1 3026  301 2014 2014
1 3026  301 2014 2014
1 3026  301 2014 2014
1 3026  301 2015 2014
1 3026  301 2015 2014
1 3026  301 2015 2014
1 3026  301 2015 2014
1 3026  301 2015 2014
1 3026  301 2015 2014
1 3026  301 2015 2014
1 3026  301 2015 2014
1 3026  301 2015 2014
1 3026  301 2015 2014
1 3026  301 2015 2014
1 3026  301 2015 2014
1 3026  301 2015 2014
1 3026  301 2015 2014
1 3026  301 2015 2014
1 3026  301 2015 2014
1 3026  301 2015 2014
1 3026  301 2015 2014
1 3026  301 2015 2014
1 3026  301 2015 2014
1 3026  301 2015 2014
1 3026  301 2015 2014
1 3026  301 2015 2014
1 3026  301 2015 2014
1 3051  301 2008 2008
1 3051  301 2008 2008
1 3051  301 2008 2008
1 3051  301 2008 2008
1 3051  301 2008 2008
1 3051  301 2008 2008
1 3051  301 2008 2008
1 3051  301 2009 2008
1 3051  301 2009 2008
1 3051  301 2009 2008
1 3051  301 2009 2008
1 3051  301 2009 2008
1 3051  301 2009 2008
1 3051  301 2009 2008
1 3051  301 2009 2008
1 3051  301 2009 2008
1 3051  301 2009 2008
1 3051  301 2009 2008
1 3051  301 2010 2008
1 3051  301 2010 2008
1 3051  301 2010 2008
1 3051  301 2010 2008
1 3051  301 2010 2008
1 3051  301 2010 2008
1 3051  301 2010 2008
end

by id_molecule, sort: egen first_year_molecule = min(data_lancio)
by idfirm id_molecule, sort: egen first_year_product = min(Year)
gen byte introduction = (Year == first_year_molecule)


collapse (max) introduced_a_new_molecule = introduction, by(idfirm Year)

Comment

Federico Nutarelli

Join Date: Sep 2018

Posts: 430
#7

20 Jan 2019, 08:47

Thank you again very much,
I will try the code and let you know as soon a possible. Actually however, I would like to generate two "byte types" variables, one for the first year the molecule has been introduced in the market which I think is the one that is going to be generated with your code (right?) and another one taking on value 1 when the molecule has been first introduced in the firm which I think will be something like:

Code:

gen byte introduction = (Year == first_year_product)

right?
For the moment many thanks!

Last edited by Federico Nutarelli; 20 Jan 2019, 09:05.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 28556
#8

20 Jan 2019, 10:46

one for the first year the molecule has been introduced in the market which I think is the one that is going to be generated with your code (right?)

Right.

and another one taking on value 1 when the molecule has been first introduced in the firm

That would be

Code:

by id_firm id_molecule (Year), sort: gen byte wanted = (_n == 1)
1 like
Comment
Federico Nutarelli

Join Date: Sep 2018

Posts: 430
#9

21 Jan 2019, 08:04

Perfect, many thanks Professor!
1 like
Comment

Announcement

Variable generation

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment