Readying My Data for Panelwise Analysis

Tom Schaars

Join Date: Apr 2019
Posts: 3

Readying My Data for Panelwise Analysis

05 Jul 2019, 05:53

Hello Statalist,

I have been working on a dataset in Stata for the past month or so, but have recently run into a couple issues that go far beyond my current skills. I am using Stata version 14.0.

Some quick background information: I've been working with a fairly large dataset (~70k obs) that describes contracts between car manufacturers and parts suppliers. The components range from tires to alternators to carpets, but are unfortunately not uniquely described (eg descriptions for tires can be "tires" "tyres" "tires: all four" "tires: performance" "tyre: 225/40R18" ....).
What I've done is selected the observations that all describe the same component and have done some data cleaning such as dropping duplicates etc.
The end result is as follows:

comp_id	regcode	parentgr	oembrand	uniq_modelname	launchyear	suppliername
tires	WEU	GM	Opel	WEU Astra	1998	Continental
tires	WEU	GM	Opel	WEU Astra	2004	Bridgestone
tires	WEU	GM	Opel	WEU Astra	2009	Bridgestone
tires	WEU	VW Group	Porsche	WEU Boxster	1996	Sumitomo
tires	WEU	VW Group	Porsche	WEU Boxster	1996	Continental
tires	WEU	VW Group	Porsche	WEU Boxster	1996	Bridgestone
tires	WEU	VW Group	Porsche	WEU Boxster	2004	Bridgestone
tires	WEU	VW Group	Porsche	WEU Boxster	2012	Pirelli
tires	NAF	Toyota	Toyota	NAF Camry	2001	Goodyear
tires	...	...	...	...	...	...

Now I have done this for a total of 20 different components, which means I have 20 different Stata .do files that each select certain observations, drop the irrelevant ones, clean up the data, and sort in a way as shown above. Each .do file has about 40 to 80 different models, with the important models showing up in almost every .do file.
I also have or I am planning to add other descriptive information in extra columns such as 'total # of cars sold' or a binary 'big refresh YN' , ...

A couple observations:

The variable 'uniq_modelname' uniquely defines the car model and is a concatenated string of regcode + an earlier 'modelname' variable. The reasoning being that I sometimes have the same model in multiple regions, but for the sake of this application, I treat these as separate models.
The variable 'launchyear' denotes the different modelyears of each car model, and may later be replaced by a 'generation proxy' variable that simply counts the number of generations (1 2 3 4 ..). I have not done this yet as I don't have full information for each component: for instance you can see in the table above I am missing information for the WEU Boxster from 2008, and so renaming the three generations I have to 1 2 and 3 would not be correct.
I sometimes have multiple suppliers per launchyear, as can be seen by the 1996 generation of the WEU Boxster. I would like to reshape this into 'supplier1' 'supplier2' and 'supplier3', but am unsure when or how to do this. I will come back to this in my first main question.

From these 20 different .do files I wish to arrive to a table that looks like the one below, as I want to later declare this data a time series panel with the panel_id being the comp_id, and the timevar being a 'generation_proxy' that replaces the 'launchyear' and goes 1 to 4 (or however many gens I have). As you can see from the example provided below, sometimes I am missing information for a certain generation, and I wish to impute this for certain observations. This is my 2nd main question posted below.

comp_id	regcode	parentgr	oembrand	uniq_modelname	launchyear	genproxy	supplier1	supplier2	supplier3
tires	WEU	VW Group	Porsche	WEU Boxster	1996	1	Sumitomo	Continental	Bridgestone
tires	WEU	VW Group	Porsche	WEU Boxster	2004	2	Bridgestone
tires	WEU	VW Group	Porsche	WEU Boxster	2008	3
tires	WEU	VW Group	Porsche	WEU Boxster	2012	4	Pirelli
alternator	WEU	VW Group	Porsche	WEU Boxster	1996	1
alternator	WEU	VW Group	Porsche	WEU Boxster	2004	2	Bosch	Denso
alternator	WEU	VW Group	Porsche	WEU Boxster	2008	3
alternator	WEU	VW Group	Porsche	WEU Boxster	2012	4	Denso
..	..	..	..	..	..	..	..	..	..
tires	NAF	Toyota	Toyota	NAF Camry	2001	1	Goodyear
tires	NAF	Toyota	Toyota	NAF Camry	2006	2
tires	NAF	Toyota	Toyota	NAF Camry	2010	3	Goodyear	Bridgestone
alternator	NAF	Toyota	Toyota	NAF Camry	2001	1	Denso
..	..	..	..	..	..	..	..	..	..

As such my two main questions are the following:

I want to reshape the data, but I am unable to do so when it is not consistently the same size in terms of i(i) and j(j). I want to reshape it into wide format, but for each launchyear I can have anywhere from 0 to 5 suppliers. Every model can also have between 3 and 6 generations. Both of these prevent me from using the reshape command correctly.

>To help with the amount of suppliers each generation has, I wanted to add 'extra' observations where the suppliername variable is empty, to ensure that each launchyear has exactly 5 observations, empty or otherwise.
This would mean the 2008 'WEU Boxster' + 'tires' generation would get 5 observations added where the suppliername is empty, while the 2004 generation would get 4 obs added (since it already has 1 non-empty observation). However, I am unable to code this correctly to ensure each generation gets exactly the required number of empty observations added.

>To combat the issue of the different number of generations per modelname, I honestly have no clue. I wanted to use bys uniq_modelname reshape to do this for each distinct number of generations per modelname, but bysort is not allowed with reshape.
After reshaping the dataset, I want to impute the suppliers for a certain set of observations. I assume that if there is no information regarding a certain component, then that means that the suppliers from the previous generation stay the same (for instance, long term multi-generational contracts). If, however, this empty observation is in fact the first generation, then obviously there is nothing to impute as I don't have the information from the previous generation.
From the tabel above I have included in red the instances of the suppliers that I wish to impute. As you can see I wish to repeat the suppliers from the previous generation (no matter if this means imputing 1 or more suppliers, or if this information is itself imputed from the generation before it).
The 1996 'WEU Boxster' + 'tires' observation is the first in the dataset and has no information regarding the suppliers of this part, so this needs to remain empty as I cannot know this information.

In a sense I want to only ""impute downwards""

comp_id	regcode	parentgr	oembrand	uniq_modelname	launchyear	genproxy	supplier1	supplier2	supplier3
tires	WEU	VW Group	Porsche	WEU Boxster	1996	1	Sumitomo	Continental	Bridgestone
tires	WEU	VW Group	Porsche	WEU Boxster	2004	2	Bridgestone
tires	WEU	VW Group	Porsche	WEU Boxster	2008	3	(Bridgestone)
tires	WEU	VW Group	Porsche	WEU Boxster	2012	4	Pirelli
alternator	WEU	VW Group	Porsche	WEU Boxster	1996	1	(empty)	(empty)	(empty)
alternator	WEU	VW Group	Porsche	WEU Boxster	2004	2	Bosch	Denso
alternator	WEU	VW Group	Porsche	WEU Boxster	2008	3	(Bosch)	(Denso)
alternator	WEU	VW Group	Porsche	WEU Boxster	2012	4	(Bosch)	(Denso)
..	..	..	..	..	..	..	..	..	..
tires	NAF	Toyota	Toyota	NAF Camry	2001	1	Goodyear
tires	NAF	Toyota	Toyota	NAF Camry	2006	2	(Goodyear)
tires	NAF	Toyota	Toyota	NAF Camry	2010	3	Goodyear	Bridgestone
alternator	NAF	Toyota	Toyota	NAF Camry	2001	1	Denso
..	..	..	..	..	..	..	..	..	..

To anyone who has taken the time to read this very long post, THANK YOU. If anyone could help me solve these issues, I would be forever grateful, as this would save me days if not weeks of struggling with the Stata syntax.

Kind regards,
Tom

Last edited by Tom Schaars; 05 Jul 2019, 06:00. Reason: edited for clarity

Tags: impute, panel data, reshape, syntax

Mike Lacy

Join Date: Apr 2014

Posts: 2416
#2

05 Jul 2019, 09:49

As you recognize, your post is very long, so I won't respond to all features by any means

0) You say you have a sizeable number of "do-files" with multiple models in each one. I'm guessing you mean you have many different *dta* files. Be careful about terminology here would help. Anyway, dividing things up this way was likely not a good idea, and may well make the rest of your work harder. However, you don't show any examples of these different files, so it's hard to tell what you mean. I'd encourage you to take another look at the StataList FAQ (tab at the top left of the page) and read about using -dataex- to create example data to help describe your problem. As you're using version 14, this will require installing -dataex-, but that's a matter of a few seconds to install with -ssc dataex install-. With that tool, you can make some example datasets, and following other things in the FAQ, will be in a position to post your question differently.

1) I'd suggest you ask about a smaller piece of your problem first, or a reduced version of it.

2) For doing any panel analysis or almost anything else analytic in Stata, you don't want to reshape to wide format. Panel data in Stata needs to be represented in long format. Your concerns about differing i() and j(), as nearly as I can tell, are misguided.

3) I don't understand your explanation of "launch year" and how it is determined.
4) Adding observations, as you describe, is likely not a good idea, but I can't understand your purpose for it, so I can't comment. At any rate, I would suggest you set that issue aside, as I think it's confusing you (and potential helpers). If for some reason adding observations turns out to be relevant, it can be done after your other issues are solved. Similarly, your needs to do what you call "impute" suppliers can be set aside for the time being in the interest of you presenting a simpler posting of your problem. These are a couple of things I'd encourage you to eliminate in the interests of a reduced presentation of your problem.
Comment

Announcement

Readying My Data for Panelwise Analysis

Comment