Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Readying My Data for Panelwise Analysis

    Hello Statalist,

    I have been working on a dataset in Stata for the past month or so, but have recently run into a couple issues that go far beyond my current skills. I am using Stata version 14.0.

    Some quick background information: I've been working with a fairly large dataset (~70k obs) that describes contracts between car manufacturers and parts suppliers. The components range from tires to alternators to carpets, but are unfortunately not uniquely described (eg descriptions for tires can be "tires" "tyres" "tires: all four" "tires: performance" "tyre: 225/40R18" ....).
    What I've done is selected the observations that all describe the same component and have done some data cleaning such as dropping duplicates etc.
    The end result is as follows:
    comp_id regcode parentgr oembrand uniq_modelname launchyear suppliername
    tires WEU GM Opel WEU Astra 1998 Continental
    tires WEU GM Opel WEU Astra 2004 Bridgestone
    tires WEU GM Opel WEU Astra 2009 Bridgestone
    tires WEU VW Group Porsche WEU Boxster 1996 Sumitomo
    tires WEU VW Group Porsche WEU Boxster 1996 Continental
    tires WEU VW Group Porsche WEU Boxster 1996 Bridgestone
    tires WEU VW Group Porsche WEU Boxster 2004 Bridgestone
    tires WEU VW Group Porsche WEU Boxster 2012 Pirelli
    tires NAF Toyota Toyota NAF Camry 2001 Goodyear
    tires ... ... ... ... ... ...
    Now I have done this for a total of 20 different components, which means I have 20 different Stata .do files that each select certain observations, drop the irrelevant ones, clean up the data, and sort in a way as shown above. Each .do file has about 40 to 80 different models, with the important models showing up in almost every .do file.
    I also have or I am planning to add other descriptive information in extra columns such as 'total # of cars sold' or a binary 'big refresh YN' , ...

    A couple observations:
    • The variable 'uniq_modelname' uniquely defines the car model and is a concatenated string of regcode + an earlier 'modelname' variable. The reasoning being that I sometimes have the same model in multiple regions, but for the sake of this application, I treat these as separate models.
    • The variable 'launchyear' denotes the different modelyears of each car model, and may later be replaced by a 'generation proxy' variable that simply counts the number of generations (1 2 3 4 ..). I have not done this yet as I don't have full information for each component: for instance you can see in the table above I am missing information for the WEU Boxster from 2008, and so renaming the three generations I have to 1 2 and 3 would not be correct.
    • I sometimes have multiple suppliers per launchyear, as can be seen by the 1996 generation of the WEU Boxster. I would like to reshape this into 'supplier1' 'supplier2' and 'supplier3', but am unsure when or how to do this. I will come back to this in my first main question.
    From these 20 different .do files I wish to arrive to a table that looks like the one below, as I want to later declare this data a time series panel with the panel_id being the comp_id, and the timevar being a 'generation_proxy' that replaces the 'launchyear' and goes 1 to 4 (or however many gens I have). As you can see from the example provided below, sometimes I am missing information for a certain generation, and I wish to impute this for certain observations. This is my 2nd main question posted below.
    comp_id regcode parentgr oembrand uniq_modelname launchyear genproxy supplier1 supplier2 supplier3
    tires WEU VW Group Porsche WEU Boxster 1996 1 Sumitomo Continental Bridgestone
    tires WEU VW Group Porsche WEU Boxster 2004 2 Bridgestone
    tires WEU VW Group Porsche WEU Boxster 2008 3
    tires WEU VW Group Porsche WEU Boxster 2012 4 Pirelli
    alternator WEU VW Group Porsche WEU Boxster 1996 1
    alternator WEU VW Group Porsche WEU Boxster 2004 2 Bosch Denso
    alternator WEU VW Group Porsche WEU Boxster 2008 3
    alternator WEU VW Group Porsche WEU Boxster 2012 4 Denso
    .. .. .. .. .. .. .. .. .. ..
    tires NAF Toyota Toyota NAF Camry 2001 1 Goodyear
    tires NAF Toyota Toyota NAF Camry 2006 2
    tires NAF Toyota Toyota NAF Camry 2010 3 Goodyear Bridgestone
    alternator NAF Toyota Toyota NAF Camry 2001 1 Denso
    .. .. .. .. .. .. .. .. .. ..
    As such my two main questions are the following:
    1. I want to reshape the data, but I am unable to do so when it is not consistently the same size in terms of i(i) and j(j). I want to reshape it into wide format, but for each launchyear I can have anywhere from 0 to 5 suppliers. Every model can also have between 3 and 6 generations. Both of these prevent me from using the reshape command correctly.

      >To help with the amount of suppliers each generation has, I wanted to add 'extra' observations where the suppliername variable is empty, to ensure that each launchyear has exactly 5 observations, empty or otherwise.
      This would mean the 2008 'WEU Boxster' + 'tires' generation would get 5 observations added where the suppliername is empty, while the 2004 generation would get 4 obs added (since it already has 1 non-empty observation). However, I am unable to code this correctly to ensure each generation gets exactly the required number of empty observations added.

      >To combat the issue of the different number of generations per modelname, I honestly have no clue. I wanted to use bys uniq_modelname reshape to do this for each distinct number of generations per modelname, but bysort is not allowed with reshape.
    2. After reshaping the dataset, I want to impute the suppliers for a certain set of observations. I assume that if there is no information regarding a certain component, then that means that the suppliers from the previous generation stay the same (for instance, long term multi-generational contracts). If, however, this empty observation is in fact the first generation, then obviously there is nothing to impute as I don't have the information from the previous generation.
      From the tabel above I have included in red the instances of the suppliers that I wish to impute. As you can see I wish to repeat the suppliers from the previous generation (no matter if this means imputing 1 or more suppliers, or if this information is itself imputed from the generation before it).
      The 1996 'WEU Boxster' + 'tires' observation is the first in the dataset and has no information regarding the suppliers of this part, so this needs to remain empty as I cannot know this information.

      In a sense I want to only ""impute downwards""
    comp_id regcode parentgr oembrand uniq_modelname launchyear genproxy supplier1 supplier2 supplier3
    tires WEU VW Group Porsche WEU Boxster 1996 1 Sumitomo Continental Bridgestone
    tires WEU VW Group Porsche WEU Boxster 2004 2 Bridgestone
    tires WEU VW Group Porsche WEU Boxster 2008 3 (Bridgestone)
    tires WEU VW Group Porsche WEU Boxster 2012 4 Pirelli
    alternator WEU VW Group Porsche WEU Boxster 1996 1 (empty) (empty) (empty)
    alternator WEU VW Group Porsche WEU Boxster 2004 2 Bosch Denso
    alternator WEU VW Group Porsche WEU Boxster 2008 3 (Bosch) (Denso)
    alternator WEU VW Group Porsche WEU Boxster 2012 4 (Bosch) (Denso)
    .. .. .. .. .. .. .. .. .. ..
    tires NAF Toyota Toyota NAF Camry 2001 1 Goodyear
    tires NAF Toyota Toyota NAF Camry 2006 2 (Goodyear)
    tires NAF Toyota Toyota NAF Camry 2010 3 Goodyear Bridgestone
    alternator NAF Toyota Toyota NAF Camry 2001 1 Denso
    .. .. .. .. .. .. .. .. .. ..

    To anyone who has taken the time to read this very long post, THANK YOU. If anyone could help me solve these issues, I would be forever grateful, as this would save me days if not weeks of struggling with the Stata syntax.

    Kind regards,
    Tom
    Last edited by Tom Schaars; 05 Jul 2019, 06:00. Reason: edited for clarity

  • #2
    As you recognize, your post is very long, so I won't respond to all features by any means

    0) You say you have a sizeable number of "do-files" with multiple models in each one. I'm guessing you mean you have many different *dta* files. Be careful about terminology here would help. Anyway, dividing things up this way was likely not a good idea, and may well make the rest of your work harder. However, you don't show any examples of these different files, so it's hard to tell what you mean. I'd encourage you to take another look at the StataList FAQ (tab at the top left of the page) and read about using -dataex- to create example data to help describe your problem. As you're using version 14, this will require installing -dataex-, but that's a matter of a few seconds to install with -ssc dataex install-. With that tool, you can make some example datasets, and following other things in the FAQ, will be in a position to post your question differently.

    1) I'd suggest you ask about a smaller piece of your problem first, or a reduced version of it.

    2) For doing any panel analysis or almost anything else analytic in Stata, you don't want to reshape to wide format. Panel data in Stata needs to be represented in long format. Your concerns about differing i() and j(), as nearly as I can tell, are misguided.

    3) I don't understand your explanation of "launch year" and how it is determined.
    4) Adding observations, as you describe, is likely not a good idea, but I can't understand your purpose for it, so I can't comment. At any rate, I would suggest you set that issue aside, as I think it's confusing you (and potential helpers). If for some reason adding observations turns out to be relevant, it can be done after your other issues are solved. Similarly, your needs to do what you call "impute" suppliers can be set aside for the time being in the interest of you presenting a simpler posting of your problem. These are a couple of things I'd encourage you to eliminate in the interests of a reduced presentation of your problem.

    Comment

    Working...
    X