Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Calculating and then applying different growth factors to the same variable in different datasets

    My original data is in panel format and contains a large number of countries. Here's an extract:

    Code:
    input int time str6 timecode str16 country str3 countrycode float cons_pc
    2005 "YR2005" "Burkina Faso"     "bfa" 154960.66
    2006 "YR2006" "Burkina Faso"     "bfa" 158090.44
    2007 "YR2007" "Burkina Faso"     "bfa" 155224.17
    2008 "YR2008" "Burkina Faso"     "bfa"  177343.6
    2009 "YR2009" "Burkina Faso"     "bfa"  178538.2
    2010 "YR2010" "Burkina Faso"     "bfa"    178796
    2011 "YR2011" "Burkina Faso"     "bfa" 187291.05
    2005 "YR2005" "Burundi"          "bdi" 127747.41
    2006 "YR2006" "Burundi"          "bdi" 155927.84
    2007 "YR2007" "Burundi"          "bdi" 129821.98
    2008 "YR2008" "Burundi"          "bdi" 155126.52
    2009 "YR2009" "Burundi"          "bdi"    197272
    2010 "YR2010" "Burundi"          "bdi"  179738.9
    2011 "YR2011" "Burundi"          "bdi"  220682.9
    end
    I need to calculate a growth factor for cons_pc from year x to 2009 for each country. Year x varies from country to country and is not listed nor included in the dataset. The way I tackled the problem is converting the dataset from long to wide and then calculate the growth factor for each country, one by one by entering the below code, which requires manual entry of the years. I save it as one variable per country and then overlap them to have one single column of growth factors. For instance:

    Code:
    local x cons_pc
    
    gen grx_1=`x'2009/`x'2006 if countrycode=="bdi"
    gen grx_2=`x'2009/`x'2009 if countrycode=="bfa"
    
    egen gr_x= rowmax(grx_1-grx_150)
    Two issues, here:

    1. Is there a more effective way of generating the above growth factors? (may be related to my next question)
    2. I have another set of .dta files, one per country. I want to apply each country's growth factor to a specific variable found in each country's respective .dta file. How can I do this?

    For question 2, I am aware I can loop over each of the country files and create a empty factor variable in each. Then, replace it with the appropriate growth factor, such as :

    Code:
    replace factor=xxxxx if "`dataset'" == "bdi"
    replace factor=xxxxx if "`dataset'" == "bfa"
    ..for each of the country .dta files. This; however, entails manually typing the growth factor I calculated - which I'd prefer not to..

    All advice is welcomed!

  • #2
    I'm not sure I understand completely what you are trying to do, but I have a few questions in any case.

    1. For country bdi the reference year for your growth ratio is 2006, whereas for bfa it's 2009. You say that the year varies from country to country and is not included in the data set. OK, but then where does it come from? Is this a list that you just have in your head, or does it exist in some other data file (ideally in Stata, but any other format we can import from will do)? It shouldn't be necessary to type this out one by one for each country, or at least it shouldn't be necessary more than once in your lifetime. If it really does not exist in any data file at the moment, you would be well advised to create such a Stata file, containing one observation for each country with just two variables: countrycode and reference_year. Let's call that file crosswalk.dta

    2. What is the reason for creating a separate variable for each country's growth ratio? It seems to me it would make more sense to have it all in a single variable that is constant within countries, but varies across countries, which is what you seem to be calculating as gr_x. That seems to be an artifact of having chosen to -reshape wide-, which was, I think, a poor choice that made life more complicated than it needed to be. (Most things in Stata are easier in long layout than in wide.)

    So after you create the data set I describe in question 1, I would calculate the growth factors as follows:

    Code:
    clear
    input int time str6 timecode str16 country str3 countrycode float cons_pc
    2005 "YR2005" "Burkina Faso"     "bfa" 154960.66
    2006 "YR2006" "Burkina Faso"     "bfa" 158090.44
    2007 "YR2007" "Burkina Faso"     "bfa" 155224.17
    2008 "YR2008" "Burkina Faso"     "bfa"  177343.6
    2009 "YR2009" "Burkina Faso"     "bfa"  178538.2
    2010 "YR2010" "Burkina Faso"     "bfa"    178796
    2011 "YR2011" "Burkina Faso"     "bfa" 187291.05
    2005 "YR2005" "Burundi"          "bdi" 127747.41
    2006 "YR2006" "Burundi"          "bdi" 155927.84
    2007 "YR2007" "Burundi"          "bdi" 129821.98
    2008 "YR2008" "Burundi"          "bdi" 155126.52
    2009 "YR2009" "Burundi"          "bdi"    197272
    2010 "YR2010" "Burundi"          "bdi"  179738.9
    2011 "YR2011" "Burundi"          "bdi"  220682.9
    end
    // OR OPEN YOUR EXISTING DATA SET IN REAL LIFE
    // DON'T RESHAPE WIDE
    
    merge m:1 countrycode using crosswalk, keep(match master) 
    
    // FIND ANY COUNTRY CODES WITH NO MATCH IN THE CROSSWALK
    // THESE MAY BE ERRORS
    tab country_code if _merge == 1
    
    egen ref_year_cons_pc = max(cond(time == reference_year, cons_pc, .)), by(countrycode)
    egen year_2009_cons_pc = max(cond(time == 2009, cons_pc, .)), by(countrycode)
    gen gr_x = year_2009_cons_pc/ref_year_cons_pc
    It is hard to give you concrete advice about your second question because you don't say much about the filenames or what's in them. Just as an illustration, I'm going to assume that each country's filename is its countrycode.dta. So Burkina Faso's file is called bfa.dta. I'm also going to (optimistically, because it gets more complicated otherwise) that each country's file contains the same variables, named the same, as every other, and that what is numeric in one file is numeric in the others, etc. That is, the files are the same except for whose data they happen to contain. I will assume the variable you want to apply the growth factor to is called target_var, and I will assume that "applying the growth factor" means multiplying by it.

    Code:
    // CREATE A TEMPORARY FILE WITH A CROSSWALK BETWEEN
    // COUNTRY CODES AND CORRESPONDING gr_x
    keep countrycode gr_x
    duplicates drop
    tempfile gr_x_crosswalk
    save `gr_x_crosswalk'
    
    // NOW APPEND ALL OF THE COUNTRY FILES TOGETHER
    levelsof countrycode, local(codes)
    
    tempfile building
    
    foreach c of local codes {
        use `c', clear
        gen country_code = "`c'" // UNLESS THIS ALREADY EXISTS IN THESE FILES
        capture confirm file `building'
        if c(rc) == 0 {
            append using `building'
        }
        save `"`building'"', replace
    }
    
    // NOW MERGE IN THE GROWTH FACTORS
    merge m:1 countrycode using `gr_x_crosswalk'
    gen modified_target_var = gr_x * target_var
    You now have a single file with all of the countries' variables modified by their growth factor. You can save it as is, or break it up into separate country files, or whatever.

    -merge- and -append- are indispensable data management commands when you need to reference information from different data sets for a single analysis. The time spent learning how to use them well will be amply repaid. The [D] user manual sections on these are clearly written and have many helpful examples. They are well worth a read. Figuring out how to use -merge- takes a bit of practice and patience: it's not immediately obvious what to do the first time you try it--but at a certain point it kind of "clicks" in your head and it's easy from that point on.


    Comment


    • #3
      Thank you for your help, Clyde.

      My response to your questions:

      1. Great suggestion. I created the file as suggested and implemented your code. Works great.
      2. Converting to -wide- was clearly a bad idea. This was the only reason I chose to create separate variable for each country's growth ratio.

      Now, regarding your suggestion for (2), all of your assumptions that precede your suggestion describe precisely the structure of the files I'm working with and what I'm after. One thing: how would I go about applying the growth rates to each of the original country files (bfa.dta, bdi.dta, etc.), so that each of these country files has a target_var that his been modified by its corresponding growth factor, all while keeping the rest of variables in each country file intact.

      Finally, can you explain to me the meaning of following lines of code (I'm relatively new to Stata):
      Code:
      levelsof countrycode, local(codes)
      Code:
      c(rc)
      Thanks very much for you help.

      Comment


      • #4
        Now, regarding your suggestion for (2), all of your assumptions that precede your suggestion describe precisely the structure of the files I'm working with and what I'm after. One thing: how would I go about applying the growth rates to each of the original country files (bfa.dta, bdi.dta, etc.), so that each of these country files has a target_var that his been modified by its corresponding growth factor, all while keeping the rest of variables in each country file intact.
        So, unless these files are very large and you are running out of space on your storage device, I would not overwrite the original files. Better practice is to create no ones. So, after you do what I suggested in #2, you have a single file containing all the countries. You just need to now break it up into country files. So picking up right where we left off:
        Code:
        foreach c of local codes {
            preserve
            keep if countrycode == "`c'"
            save `c'_2, replace
            restore
        }
        The -levelsof- command identifies all of the distinct values of the variable named and stores them in a local macro with the name specified in the local() option. So after that command executes, local macro codes contains "bfa bdi" if run on your sample data. Presumably in your real data the list is longer. This type of command is used all the time when we need to loop over the values of a variable and cannot accomplish it with the -by- prefix. See -help levelsof- and the manual section for more details. It's an extremely useful command.

        To understand -c(rc)- you first have to understand -capture-. Notice that the first time through the -foreach c of local codes- loop, the file `building' does not yet exist. The -tempfile- command reserves its name, but does not actually create a file. So while I really just want to keep appending, I have to make an exception for the first time. The command -confirm file `building'- tells Stata to check whether the file `building' already exists. If it does, as is the case every time but the first through the loop, Stata does nothing and proceeds to the next command. But on that first iteration, the file does not yet exist. Without the -capture- in front of it, the behavior of -confirim- is to throw an error message and halt execution when the thing it's asked to confirm isn't really so. You may have noticed in your experience with Stata that when you get an error message, it is usually preceded by r(#), where # is some number. Actually, even when there is no error, Stata returns an error code, namely zero, but it doesn't tell you about it. That number is called the error code, and it identifies, as best Stata could, what the problem is. When you put -capture- in front of a command, two things about its execution are changed:

        1. Output is suppressed; the command is executed -quietly-. Since -confirm- doesn't normally generate any output, this makes no real difference if there is no error. But if there is an error, the error message is not displayed.

        2. The return code that would appear in the error message is saved in c(rc).

        There are a lot of system parameters and other information stored in c(). To see them all, just run -creturn list-. Many of them are useful in many contexts. c(rc) is useful for determining whether the most recently -capture-d command executed normally or found an error. If the command worked normally, c(rc) == 0. If there was an error, then c(rc) will be the non-zero error code.

        So here's a translation of the general idiom:
        Code:
        capture do something  // TRY TO do something. But if it doesn't work, keep quiet and keep going.
        
        if c(rc) == 0 { // If that earlier trial didn't encounter any errors
            do this
        }
        
        else { // If that earlier trial did encounter an error
           do that
        }
        This general type of construct allows the code to flexibly deal with situations such as the non-existence of a file it expects, or the non-existence of a variable in your data set, or the possibility that a variable could be either string or numeric, or lots of other things.

        Comment


        • #5
          Thanks again for your help and very clear explanations.

          I tried running the code and it did not work. I believe the issue is that I am not specifying the file directory that contains the country files (e.g. countrycode.dta).

          Where in your code should I specify the directory with all the country files? I presume I should be somehow redirecting the local macro, below, to such directory.
          Code:
          use `c', clear

          Comment


          • #6
            Yes. So let's say your files are in "C:\Users\John Pevedia\Documents\Country Files" (a typical Windows directory name). Then you would write

            Code:
            use "C:/Users/John Pevedia/Documents/Country Files/`c'", clear
            Note the use of forward slashes (/), not backslashes. That is critical, because if you have \`, Stata will take that as a literal ` character and will not expand local macro c! Don't worry that Windows uses backslashes to separate directories: when Stata passes the request for the file handle to the operating system, it takes care of this for you. (Of course, if you are running on Mac or Unis, the forward slash is the usual separator and there is nothing special here. Just use the normal directory name.)

            That is probably the least complicated way to do it. Another approach is to use the -cd- or -pwd- command to change the working directory to the one that contains the country files. But then you may need to change back later in the code, which means that you first need to store the name of the original working directory somewhere. None of that is terribly difficult, but if the only reason you need to access this other directory is to read these files, then changing the -user- command is less of a bother.

            That said, in most projects that I work on, I have my Stata files (.dta, .do, project-specific .ado, and .smcl or .log) in one directory. And if my original .dta files are imported from other sources (Excel, SAS, SPSS, etc.) I have those in that same directory, too. That way, I don't need to specify full path names when I -use- or -import- files: everything is all in one place. There are some circumstances where the directory starts to get too big and I have to deviate from this practice, and others where the files I"m trying to access are on a remote server and I do not have access that lets me copy them to my own working directory (or I don't want to copy them for reasons of disk space or data security). But I generally recommend the practice of keeping all files relating to data management and analysis of a single project in a single directory. (Administrative files I usually keep in a separate sub-directory or sister directory.)

            Comment


            • #7
              Clyde, a bit late, but thank you for taking the time to help me out with this and provide some thorough explanations. I also took note on your tips and they've worked well for me thus far.

              Comment

              Working...
              X