Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Deciles + Average

    I am new on Stata and do not really understand some basic commands as well as the terminology used, so I would really appreciate if you could provide me a step-by-step answer if it is possible. I will explain, along the following points, the analysis I want to perform:

    1) I have 120 variables with 9,630 observations each (i.e. 120 columns and 9,630 rows), although some observations are null (i.e. they display "-"). For each variable, I want to assign to all observations a number from 1 to 10, being number 1 the observations with the highest values and number 10 the observations with the lowest values. In other words, I want to assign a decile (from 1 to 10) to all observations that comprise each of the 120 variables. It is important to note that all observations correspond to the same characteristic (i.e. a given type of fee), while each variable (i.e. each column) corresponds to each month from January 2004 to December 2013 (i.e. 120 months).

    2) Once I have sorted all the observations into deciles for each of the 120 variables, I want to calculate the average of a different characteristic (i.e. returns) for each decile and variable based on the sorting performed previously. It would be really helpful if you could clarify me whether I should import a dataset containing information on the two characteristics (i.e. 120 (months)*2(characteristics) columns and 9,630 rows). As I stated before, I am relatively new on Stata and I am used to import data from Excel.

    3) The final output table containing the averages should exhibit 10 columns (i.e. one for each decile) and 120 rows (i.e. one for each month from January 2004 to December 2013)

    Thank you in advance for your help.

  • #2
    1) You don't tell us what your 120 variables are called: you tell us what they represent, but that isn't really helpful. I'll just assume they are called var1-var120 and they appear in consecutive order in your data set.
    Code:
    foreach v of varlist var1-var120 {
        xtile decile_`v' = `v', nq(10)
        replace decile_`v' = 11-decile_`v'
    }
    generates the decile rankings. Conventionally, in statistics, decile #1 is the one with the lowest values, decile 10 would be the one with the highest values. It appears you want the reverse of that, hence the -replace decile_`v' = 11-decile_`v'- command.

    2) You now refer mysteriously to two characteristics. One of them, I guess, is the values in your data set in the 120 variables and 9630 observations. You want the decile-specific means for that:
    Code:
    foreach v of varlist var1-var120 {
        egen mean_decile_`v' = mean(`v'), by(decile_`v')
    }
    drop var1-var120
    gen long obs_no = _n
    reshape long decile_ mean_decile_, i(obs_no) j(varname) string
    drop obs_no
    duplicates drop
    reshape wide mean_decile_, i(varname) j(decile_)
    That will do it with respect to the one "characteristic."

    You give no clue where the second "characteristic" comes from, and I cannot guess. So I will leave it to you to figure out what to do about that.
    Last edited by Clyde Schechter; 27 Jun 2016, 20:21. Reason: Correct serious code errors in original post.

    Comment


    • #3
      With regard to the second point, yes, one of the characteristics is the original one displaying the 120 variables and 9,630 observations. I do not know if the second code performs the analysis I want to carry out. First, I want to sort the observations into deciles for each of the 120 variables, which is what Code 1 does, I guess. After, I want to calculate the averages of those same deciles. But not the averages of the first characteristic, which has been used to sort into deciles, but rather the averages of the second characteristic based on the ranking performed with the first characteristic.

      One of my questions is whether I should include both characteristics on my initial dataset, which would have 240 variables (i.e. 120 for each characteristic) and 9,630 observations. I do not know if I am clear on my explanations, if not, please let me know. Thank you in advance.

      Comment


      • #4
        I still have no idea what you mean when you refer to two characteristics. First, the word "characteristic" has a meaning in Stata that is completely unrelated to your use of the term, and is almost certainly not germane here. Second, leaving aside the Stata meaning of the term, apparently you want to average this second characteristic based on the deciles of the first. That would seem to mean that this "second characteristic" is also a variable--yet from your description of the data it is not. So I don't see anyway for Stata to average some variable for which you have no data. What am I missing here?

        Comment


        • #5
          Ok, I will assume that I have to merge both "chracteristics" into one dataset. So the initial dataset would contain 240 variables, 120 variables for the first "characteristic", that will serve to sort into deciles and 120 variables (i.e. the second "characteristic") that will be averaged based on the previously estimated deciles. The number of observations would be 9,630, with some of them displaying null values (i.e. "-").

          I have entered code 1 from #2, but I have noticed that Stata does not produce deciles 2 and 3. The reason behind this may be that the observations exhibit duplicates and null values (i.e. “-“). How can I sort the observations into deciles taking into account that the observations are repeated and not considering null observations?

          Do you need more information to give an accurate answer? Thank you in advance.


          Comment


          • #6
            First, "-" is not an appropriate "null value" in Stata. If an observation is missing, in Stata it should appear as one of Stata's missing values, usually just a period. If your data contains observations "-" then your variable is a string and you won't be able to calculate anything with it. You have to replace those "-" values with "", and then destring the variable. The code below assumes you already have the data imported into Stata, in memory, and with the relevant variables all numeric.

            I'm going to assume that your first "characteristic" is expressed in var1-var120 and that the second one is in second1-second120. I believe the following code will do what you want:
            Code:
            foreach v of varlist var1-var120 {
                xtile decile_`v' = `v', nq(10)
                replace decile_`v' = 11-decile_`v'
            }
            forvalues i / 1/120 {
                egen mean_decile_second`i' = mean(second`i'), by(decile_var`i')
            }
            keep mean_decile_* decile_*
            gen long obs_no = _n
            reshape long decile_ mean_decile_, i(obs_no) j(varname) string
            drop obs_no
            duplicates drop
            reshape wide mean_decile_, i(varname) j(decile_)
            As for your concern about the omission of certain deciles, that is not a problem you can solve. It arises from ties in the data, as you noted. Stata cannot and will not take two observations with the same value and assign them to different deciles. Such a decision would be entirely arbitrary and would be irreproducible from one run to the next so that results based on it would be meaningless. Apparently your data cannot be separated into 10 approximately equal sized blocks because the values are too "clumped together." If that is so, the problem you are trying to solve has no solution, and you need to take a different approach.

            Comment


            • #7
              How can I replace "-" values with "" and then destring all variables?

              With regard to the composition of the deciles, it is not possible to sort the observations into deciles that display equal number of observations? You mentioned taking a different approach, what could I do to solve this issue?

              Thank you so much.

              Comment


              • #8
                You need to be very careful about replacing "-" characters as you could be zapping minus signs. Hence replace such characters with empty if and only if there are at most surrounding spaces.

                Code:
                ds, has(type string) 
                
                foreach v in `r(varlist)' { 
                     replace `v' = "" if trim(`v') == "-" 
                }

                Comment


                • #9
                  I have entered code #4 but nothing has changed, there are still "-" null values...

                  Comment


                  • #10
                    Code #8 sorry

                    Comment


                    • #11
                      With regards to #10, maybe it is due to a difference between "-" and "–", as already remarked in #8.
                      Best regards,

                      Marcos

                      Comment


                      • #12
                        Inigo, we were all beginners once. And using a Forum like this is a great way to advance your Stata skills. But I think there is a suite of basic commands that everybody needs to understand and be able to use in order to get any work done with Stata. That background can be more efficiently learned by reading Getting Started with Stata [U], and in the User's Guide [U]. The time you spend reading and mastering that material will be amply repaid. Those documents are part of the PDF documentation that comes with your Stata installation (if you are using version 11 or beyond). To access it, launch Stata and drop down the Help menu and select PDF Documentation. Then follow the links to the GS and U sections.

                        The data management question you posed in your initial post is, in my view, at the intermediate to advanced level. So, in my response, I assumed that, despite a modest claim to be a beginner, you would be comfortable with things like cleaning strings and converting them to numeric data. I think you will find it more efficient and effective to pursue a project that entails the level of complexity implied by your initial post after giving yourself a quick introduction to the Stata basics. Feel free to ask questions on this forum as you work your way through that.

                        Returning to your questions in #7
                        With regard to the composition of the deciles, it is not possible to sort the observations into deciles that display equal number of observations?
                        In a word, no. Consider the following data set
                        Code:
                        input x y
                        1 0
                        1 1
                        1 2
                        1 3
                        1 4
                        1 5
                        1 6
                        1 7
                        1 8
                        1 9
                        1 10
                        1 11
                        2 12
                        2 13
                        2 14
                        2 15
                        2 16
                        end
                        It contains 17 observations, so it can't be divided into 10 equal sizegroups in any way at all. Of course, you could get 10 approximately equal sized groups, of size 1 or 2 each. But even that won't give you deciles that would be useful for your purposes. Look at all the tied values of x. There are 12 1's and 5 2's. How will you break those up. There is no reason to put any particular observation with x = 1 into the lowest decile of x. They all have an equal claim to it. And, for your purposes, it matters greatly which goes in which decile, because the mean values of y by decile of x will differ radically depending on how you partition the observations into groups.

                        What are the alternatives? Without knowing what these variables are and why you're trying to do this in the first place, nobody can really advise. Even knowing those things, there may be conventions in your discipline that govern how they are done that I would probably be unfamiliar with. I think your best bet is to explain to an experienced colleague or your supervisor the problem that tied values in the data are creating, and ask that person's advice about an different strategy. Perhaps it would be possible to get quintiles and that would be adequate for your purpose. Or perhaps it is best to abandon the whole approach of breaking up into groups: if you are looking to see how "characteristic 2" changes as we go up the value scale for "characteristic 1," it might make more sense to just calculate a correlation coefficient. Or perhaps your colleague/supervisor can suggest something else that is more common in your field.

                        Comment


                        • #13
                          On the report in #9 and #10, and indeed more generally: Show us sample data where the problem arises.

                          Comment


                          • #14
                            How can I show my sample data?

                            Comment


                            • #15
                              download and install and use "dataex" ; see #12 in the FAQ (which you should have read prior to posting) for more information

                              Comment

                              Working...
                              X